This is an extensive Exploratory Data Analysis for the Recruit Restaurant Visitor Forecasting competition with the powers of tidy R and ggplot2. Also We tried to translate the main R code to Python.
The aim of this challenge is to predict the future numbers of restaurant visitors. This makes it a Time Series Forecasting problem. The data was collected from Japanese restaurants. As we will see, the data set is small and easily accessible without requiring much memory or computing power. Therefore, this competition is particularly suited for beginners.
The data comes in the shape of 8 relational files which are derived from two separate Japanese websites that collect user information: “Hot Pepper Gourmet (hpg): similar to Yelp” (search and reserve) and “AirREGI / Restaurant Board (air): similar to Square” (reservation control and cash register). The training data is based on the time range of Jan 2016 - most of Apr 2017, while the test set includes the last week of Apr plus May 2017. The test data “intentionally spans a holiday week in Japan called the ‘Golden Week.’ The data description further notes that:”There are days in the test set where the restaurant were closed and had no visitors. These are ignored in scoring. The training set omits days where the restaurants were closed.”
Those are the individual files:
air_visit_data.csv: historical visit data for the air restaurants. This is essentially the main training data set.
air_reserve.csv / hpg_reserve.csv: reservations made through the air / hpg systems.
air_store_info.csv / hpg_store_info.csv: details about the air / hpg restaurants including genre and location.
store_id_relation.csv: connects the air and hpg ids
date_info.csv: essentially flags the Japanese holidays.
sample_submission.csv: serves as the test set. The id is formed by combining the air id with the visit date.
Thank you all for the upvotes, critical feedback, and continued support!
# Set python environment and version in RStudio ;-)
reticulate::use_python("/usr/bin/python3.10", required = TRUE)
reticulate::py_config()## python: /usr/bin/python3.10
## libpython: /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
## pythonhome: //usr://usr
## version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
## numpy: /home/kirus/.local/lib/python3.10/site-packages/numpy
## numpy_version: 1.26.0
##
## NOTE: Python version was forced by use_python() function
## 'en_US.utf8'
## [1] "/home/kirus/R/x86_64-pc-linux-gnu-library/4.3"
## [2] "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"
## [4] "/usr/lib/R/library"
## /usr/lib/R
# os.environ['R_HOME'] = "/home/kirus/R/x86_64-pc-linux-gnu-library/4.3"
# print(os.environ['R_HOME'])We need to set R libraries path when called by Python.
Two paths will be used:
/usr/lib/R/library for packages installed with root
profile during installing R base,
/home/kirus/R/x86_64-pc-linux-gnu-library/4.3 for
packages installed by user without root profile.
import rpy2.rinterface
#rpy2.rinterface.set_initoptions((b'rpy2', b'--no-save', b'--no-restore', b'--quiet'))
from rpy2.robjects.packages import importr## /home/kirus/.local/lib/python3.10/site-packages/rpy2/rinterface_lib/embedded.py:276: UserWarning: R was initialized outside of rpy2 (R_NilValue != NULL). Trying to use it nevertheless.
## warnings.warn(msg)
## R was initialized outside of rpy2 (R_NilValue != NULL). Trying to use it nevertheless.
We load a range of libraries for general data wrangling and general visualisation together with more specialised tools.
# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('gridExtra') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('DT') # display table
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
# specific visualisation
library('ggrepel') # visualisation
library('ggridges') # visualisation
library('ggExtra') # visualisation
library('ggforce') # visualisation
library('viridis') # visualisation
# specific data manipulation
library('lazyeval') # data wrangling
library('broom') # data wrangling
library('purrr') # string manipulation
# Date plus forecast
library('lubridate') # date and time
library('timeDate') # date and time
library('tseries') # time series analysis
library('forecast') # time series analysis
#library('prophet') # time series analysis
library('timetk') # time series analysis
# Maps / geospatial
library('geosphere') # geospatial locations
library('leaflet') # maps
library('leaflet.extras') # maps
library('maps') # maps
library(reticulate) # interface to python#pip3 install matplotlib
#pip3 install seaborn
#pip3 install scikit-learn
#pip3 install rpy2
# pip install git+https://github.com/h2oai/datatable
# pip install ipywidgets==7.0.1
from rpy2 import robjects
from rpy2.robjects.packages import importr, data
import rpy2.robjects.packages as rpackages
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
import matplotlib
from matplotlib import pyplot as plt, dates
#import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import locale
from statsmodels.nonparametric.smoothers_lowess import lowess
from matplotlib.gridspec import GridSpec
import matplotlib.dates as mdates
from matplotlib.ticker import FormatStrFormatter
import folium
from folium.plugins import MarkerCluster
locale.setlocale(locale.LC_ALL,'en_US.utf8')## 'en_US.utf8'
#from IPython.display import display, HTML # displays html tobale in Rmarkdown
#import ipywidgets
#import qgrid # display dataframe like DT::datatable in Rmarkdown document
#import datatable as dt
#from sklearn.model_selection import train_test_split
from collections import Counter
from datetime import datetime## Load iris from python
#iris = sns.load_dataset('iris')
# load iris from R
datasets = rpackages.importr('datasets',
lib_loc= '/usr/lib/R/library/')
iris = data(datasets).fetch('iris')['iris']
# convert R dataframe to pandas df
# (https://rpy2.github.io/doc/latest/html/generated_rst/pandas.html)
with (ro.default_converter + pandas2ri.converter).context():
iris = ro.conversion.get_conversion().rpy2py(iris)
#sns.set_style("darkgrid")
plt.style.use('ggplot')
#fi = sns.FacetGrid(iris, hue ="Species",
# height = 6).map(plt.scatter,
# 'Petal.Length',
# 'Petal.Width').add_legend()
fi=sns.relplot(x="Petal.Length",
y="Petal.Width",
data=iris, alpha=0.8,
hue='Species')
fi.fig.set_figheight(4)
fi.fig.set_figwidth(8)
plt.show()We use the multiplot function, courtesy of R Cookbooks to create multi-panel plots. We also make use of a brief helper function to compute binomial confidence intervals.
# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}We use matplotlib to create subplots. The function multiplot now
works with *args to accept an arbitrary number of plots,
and the cols argument determines the number of columns in the layout. If
layout is not provided, it is calculated from cols and the number of
plots.
The example at the end demonstrates how to use the multiplot function with two plots (y1 and y2) arranged in two columns.
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
def multiplot(*args, plotlist=None, file=None, cols=1, layout=None):
plots = list(args) + (plotlist if plotlist else [])
num_plots = len(plots)
if layout is None:
rows = (num_plots - 1) // cols + 1
layout = [(i // cols, i % cols) for i in range(num_plots)]
if num_plots == 1:
plt.subplot2grid((rows, cols), (0, 0))
plt.plot(plots[0])
else:
fig = plt.figure()
gs = GridSpec(rows, cols)
for i, plot in enumerate(plots):
ax = fig.add_subplot(gs[layout[i]])
ax.plot(plot)
if file:
plt.savefig(file)
else:
plt.show()
# Example usage
#import numpy as np
#x = np.linspace(0, 10, 100)
#y1 = np.sin(x)
#y2 = np.cos(x)
#multiplot(y1, y2, cols=2)An other version using add_gridspec to create a layout
for multiple plots:
def multiplot(*args, plotlist=None, file=None, cols=1, layout=None):
plots = list(args) + (plotlist if plotlist else [])
num_plots = len(plots)
if layout is None:
rows = (num_plots - 1) // cols + 1
fig = plt.figure()
if num_plots == 1:
plt.subplot2grid((rows, cols), (0, 0), colspan=cols)
plt.plot(plots[0])
else:
gs = GridSpec(rows, cols)
for i, plot in enumerate(plots):
ax = fig.add_subplot(gs[i // cols, i % cols])
ax.plot(plot)
if file:
plt.savefig(file)
else:
plt.show()air_visits <- data.table::fread(str_c("data/air_visit_data.csv"))
air_reserve <- data.table::fread(str_c('data/air_reserve.csv'))
air_store <- data.table::fread(str_c('data/air_store_info.csv'))
holidays <- data.table::fread(str_c('data/date_info.csv'))
hpg_reserve <- data.table::fread(str_c('data/hpg_reserve.csv'))
hpg_store <- data.table::fread(str_c('data/hpg_store_info.csv'))
store_ids <- data.table::fread(str_c('data/store_id_relation.csv'))
test <- data.table::fread(str_c('data/sample_submission.csv'))In this Python code, we use the pandas library to read the CSV files. Each read_csv function call reads a CSV file into a DataFrame. The variable names (air_visit, air_reserve, etc.) will be DataFrames that you can use for data manipulation and analysis.
air_visits = pd.read_csv("data/air_visit_data.csv")
air_reserve = pd.read_csv("data/air_reserve.csv")
air_store = pd.read_csv("data/air_store_info.csv")
holidays = pd.read_csv("data/date_info.csv")
hpg_reserve = pd.read_csv("data/hpg_reserve.csv")
hpg_store = pd.read_csv("data/hpg_store_info.csv")
store_ids = pd.read_csv("data/store_id_relation.csv")
test = pd.read_csv("data/sample_submission.csv")As a first step let’s have an overview of the data sets.
We find that this file contains the visitors numbers for each visit_date and air_store_id. The date feature should be transformed into a time-series format. There are 829 different stores, which is a small data set:
## [1] 829
qgrid and ipywidgets verions are not compatible and have a issue to
render html table like DT::datatable().
#air_visits = pd.DataFrame(air_visits)
# Show the DataFrame using qgrid in Python
#qgrid_widget = qgrid.show_grid(air_visits, show_toolbar=True)
#qgrid_widget.head(10)
#html = air_visits.head(10).to_html()
#display(HTML(html))
#print(html)
air_visits.head(10)## air_store_id visit_date visitors
## 0 air_ba937bf13d40fb24 2016-01-13 25
## 1 air_ba937bf13d40fb24 2016-01-14 32
## 2 air_ba937bf13d40fb24 2016-01-15 29
## 3 air_ba937bf13d40fb24 2016-01-16 22
## 4 air_ba937bf13d40fb24 2016-01-18 6
## 5 air_ba937bf13d40fb24 2016-01-19 9
## 6 air_ba937bf13d40fb24 2016-01-20 31
## 7 air_ba937bf13d40fb24 2016-01-21 21
## 8 air_ba937bf13d40fb24 2016-01-22 18
## 9 air_ba937bf13d40fb24 2016-01-23 26
## 829
We find that the air reservations include the date and time of the reservation, as well as those of the visit. We have reservation numbers for 314 air stores:
## [1] 314
## air_store_id ... reserve_visitors
## 0 air_877f79706adbfb06 ... 1
## 1 air_db4b38ebe7a7ceff ... 3
## 2 air_db4b38ebe7a7ceff ... 6
## 3 air_877f79706adbfb06 ... 2
## 4 air_db80363d35f10926 ... 5
## 5 air_db80363d35f10926 ... 2
## 6 air_db80363d35f10926 ... 4
## 7 air_3bb99a1fe0583897 ... 2
## 8 air_3bb99a1fe0583897 ... 2
## 9 air_2b8b29ddfd35018e ... 2
##
## [10 rows x 4 columns]
## 314
The hpg reservations file follows the same structure as the corresponding air file. We have reservation numbers for 13325 hpg stores:
## [1] 13325
## hpg_store_id ... reserve_visitors
## 0 hpg_c63f6f42e088e50f ... 1
## 1 hpg_dac72789163a3f47 ... 3
## 2 hpg_c8e24dcf51ca1eb5 ... 2
## 3 hpg_24bb207e5fd49d4a ... 5
## 4 hpg_25291c542ebb3bc2 ... 13
## 5 hpg_28bdf7a336ec6a7b ... 2
## 6 hpg_2a01a042bca04ad9 ... 2
## 7 hpg_2a84dd9f4c140b82 ... 2
## 8 hpg_2ad179871696901f ... 2
## 9 hpg_2c1d989eedb0ff83 ... 6
##
## [10 rows x 4 columns]
## 13325
We find that the air_store info includes the name of the particular cuisine along with the name of the area.
## air_store_id air_genre_name ... latitude longitude
## 0 air_0f0cdeee6c9bf3d7 Italian/French ... 34.695124 135.197853
## 1 air_7cc17a324ae5c7dc Italian/French ... 34.695124 135.197853
## 2 air_fee8dcf4d619598e Italian/French ... 34.695124 135.197853
## 3 air_a17f0778617c76e2 Italian/French ... 34.695124 135.197853
## 4 air_83db5aff8f50478e Italian/French ... 35.658068 139.751599
## 5 air_99c3eae84130c1cb Italian/French ... 35.658068 139.751599
## 6 air_f183a514cb8ff4fa Italian/French ... 35.658068 139.751599
## 7 air_6b9fa44a9cf504a1 Italian/French ... 35.658068 139.751599
## 8 air_0919d54f0c9a24b8 Italian/French ... 35.658068 139.751599
## 9 air_2c6c79d597e48096 Italian/French ... 35.658068 139.751599
##
## [10 rows x 5 columns]
Again, the hpg_store info follows the same structure as the air info. Here the genre_name includes the word style. It’s worth checking whether the same is true for the air data or whether it just refers to the specific “Japanese style”. There are 4690 different hpg_store_ids, which are significantly fewer than we have reservation data for.
## hpg_store_id hpg_genre_name ... latitude longitude
## 0 hpg_6622b62385aec8bf Japanese style ... 35.643675 139.668221
## 1 hpg_e9e068dd49c5fa00 Japanese style ... 35.643675 139.668221
## 2 hpg_2976f7acb4b3a3bc Japanese style ... 35.643675 139.668221
## 3 hpg_e51a522e098f024c Japanese style ... 35.643675 139.668221
## 4 hpg_e3d0e1519894f275 Japanese style ... 35.643675 139.668221
## 5 hpg_530cd91db13b938e Japanese style ... 35.643675 139.668221
## 6 hpg_02457b318e186fa4 Japanese style ... 35.643675 139.668221
## 7 hpg_0cb3c2c490020a29 Japanese style ... 35.643675 139.668221
## 8 hpg_3efe9b08c887fe9a Japanese style ... 35.643675 139.668221
## 9 hpg_765e8d3ba261dc1c Japanese style ... 35.643675 139.668221
##
## [10 rows x 5 columns]
We called the date_info file holidays, because that’s essentially the information it contains. Holidays are encoded as binary flags in integer format. This should become a logical binary feature for exploration purposes.
## calendar_date day_of_week holiday_flg
## 0 2016-01-01 Friday 1
## 1 2016-01-02 Saturday 1
## 2 2016-01-03 Sunday 1
## 3 2016-01-04 Monday 0
## 4 2016-01-05 Tuesday 0
## 5 2016-01-06 Wednesday 0
## 6 2016-01-07 Thursday 0
## 7 2016-01-08 Friday 0
## 8 2016-01-09 Saturday 0
## 9 2016-01-10 Sunday 0
This is a relational file that connects the air and hpg ids. There are only 150 pairs, which is less than 20% of all air stores.
## air_store_id hpg_store_id
## 0 air_63b13c56b7201bd9 hpg_4bc649e72e2a239a
## 1 air_a24bf50c3e90d583 hpg_c34b496d0305a809
## 2 air_c7f78b4f3cba33ff hpg_cd8ae0d9bbd58ff9
## 3 air_947eb2cae4f3e8f2 hpg_de24ea49dc25d6b8
## 4 air_965b2e0cf4119003 hpg_653238a84804d8e7
## 5 air_a38f25e3399d1b25 hpg_50378da9ffb9b6cd
## 6 air_3c938075889fc059 hpg_349b1b92f98b175e
## 7 air_68301bcb11e2f389 hpg_2c09f3abb2220659
## 8 air_5f6fa1b897fe80d5 hpg_40aff6385800ebb1
## 9 air_00a91d42b08b08d9 hpg_fbe603376b5980fc
As noted in the data description, the id of the final submission file is a concatenation of the air_id and the date.
## id visitors
## 0 air_00a91d42b08b08d9_2017-04-23 0
## 1 air_00a91d42b08b08d9_2017-04-24 0
## 2 air_00a91d42b08b08d9_2017-04-25 0
## 3 air_00a91d42b08b08d9_2017-04-26 0
## 4 air_00a91d42b08b08d9_2017-04-27 0
## 5 air_00a91d42b08b08d9_2017-04-28 0
## 6 air_00a91d42b08b08d9_2017-04-29 0
## 7 air_00a91d42b08b08d9_2017-04-30 0
## 8 air_00a91d42b08b08d9_2017-05-01 0
## 9 air_00a91d42b08b08d9_2017-05-02 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
There are no missing values in our data. Excellent.
In Python, you can use the isna() method from the pandas library to check for missing values in a DataFrame, and then use sum() to count them. Here’s the equivalent code:
## 0
## 0
## 0
## 0
## 0
## 0
## 0
test.isna() returns a DataFrame of the same shape as
test with True where the original DataFrame has missing values, and
False otherwise.
.sum() is used twice. The first sum() computes the
count of missing values for each column, and the second sum() adds up
those counts to get the total number of missing values in the entire
DataFrame.
The final result, missing_values_count, will be the equivalent of sum(is.na(test)) in R.
We change the formatting of the date/time features and also reformat a few features to logical and factor variables for exploration purposes.
air_visits <- air_visits %>%
mutate(visit_date = ymd(visit_date))
air_reserve <- air_reserve %>%
mutate(visit_datetime = as.POSIXct(visit_datetime,format="%Y-%m-%dT%H:%MZ"), #ymd_hms(visit_datetime),
reserve_datetime = as.POSIXct(reserve_datetime,format="%Y-%m-%dT%H:%MZ")) #ymd_hms(reserve_datetime)
hpg_reserve <- hpg_reserve %>%
mutate(visit_datetime = as.POSIXct(visit_datetime,format="%Y-%m-%dT%H:%MZ"),
reserve_datetime = as.POSIXct(reserve_datetime,format="%Y-%m-%dT%H:%MZ"))
air_store <- air_store %>%
mutate(air_genre_name = as.factor(air_genre_name),
air_area_name = as.factor(air_area_name))
hpg_store <- hpg_store %>%
mutate(hpg_genre_name = as.factor(hpg_genre_name),
hpg_area_name = as.factor(hpg_area_name))
holidays <- holidays %>%
mutate(holiday_flg = as.logical(holiday_flg),
date = ymd(calendar_date),
calendar_date = as.character(calendar_date))# Convert visit_date column to datetime format
air_visits['visit_date'] = pd.to_datetime(air_visits['visit_date'])
# Convert visit_datetime and reserve_datetime columns to datetime format
air_reserve['visit_datetime'] = pd.to_datetime(air_reserve['visit_datetime'])
air_reserve['reserve_datetime'] = pd.to_datetime(air_reserve['reserve_datetime'])
hpg_reserve['visit_datetime'] = pd.to_datetime(hpg_reserve['visit_datetime'])
hpg_reserve['reserve_datetime'] = pd.to_datetime(hpg_reserve['reserve_datetime'])
# Convert categorical columns to factors (categorical variables in pandas)
air_store['air_genre_name'] = air_store['air_genre_name'].astype('category')
air_store['air_area_name'] = air_store['air_area_name'].astype('category')
hpg_store['hpg_genre_name'] = hpg_store['hpg_genre_name'].astype('category')
hpg_store['hpg_area_name'] = hpg_store['hpg_area_name'].astype('category')
# Convert holiday_flg to boolean, date to datetime, and calendar_date to string
holidays['holiday_flg'] = holidays['holiday_flg'].astype(bool)
holidays['date'] = pd.to_datetime(holidays['calendar_date'])
holidays['calendar_date'] = holidays['calendar_date'].astype(str)Here we have a first look at the distributions of the feature in our individual data files before combining them for a more detailed analysis. This inital visualisation will be the foundation on which we build our analysis.
We start with the number of visits to the air restaurants. Here we plot the total number of visitors per day over the full training time range together with the median visitors per day of the week and month of the year:
Sys.setenv(LANG = "en_US.UTF-8")
p1 <- air_visits %>%
group_by(visit_date) %>%
summarise(all_visitors = sum(visitors)) %>%
ggplot(aes(visit_date,all_visitors)) +
geom_line(col = "blue") +
labs(y = "All visitors", x = "Date")
p2 <- air_visits %>%
ggplot(aes(visitors)) +
geom_vline(xintercept = 20, color = "orange") +
geom_histogram(fill = "blue", bins = 30) +
scale_x_log10()
p3 <- air_visits %>%
mutate(wday = wday(visit_date, label = TRUE, week_start = 1)) %>%
group_by(wday) %>%
summarise(visits = median(visitors)) %>%
ggplot(aes(wday, visits, fill = wday)) +
geom_col() +
theme(legend.position = "none", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
labs(x = "Day of the week", y = "Median visitors") +
scale_fill_hue()
p4 <- air_visits %>%
mutate(month = month(visit_date, label = TRUE)) %>%
group_by(month) %>%
summarise(visits = median(visitors)) %>%
ggplot(aes(month, visits, fill = month)) +
geom_col() +
theme(legend.position = "none", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
labs(x = "Month", y = "Median visitors")
layout <- matrix(c(1,1,1,1,2,3,4,4),2,4,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)We find: * There is an interesting long-term step structure in the overall time series. This might be related to new restaurants being added to the data base. In addition, we already see a periodic pattern that most likely corresponds to a weekly cycle.
The number of guests per visit per restaurant per day peaks at around 20 (the orange line). The distribution extends up to 100 and, in rare cases, beyond.
Friday and the weekend appear to be the most popular days; which
is to be expected. Monday and Tuesday have the
lowest numbers of average visitors.
Also during the year there is a certain amount of variation.
Dec appears to be the most popular month for restaurant
visits. The period of Mar - May is consistently
busy.
We will be forecasting for the last week of April plus May 2017, so let’s look at this time range in our 2016 training data:
air_visits %>%
filter(visit_date > ymd("2016-04-15") & visit_date < ymd("2016-06-15")) %>%
group_by(visit_date) %>%
summarise(all_visitors = sum(visitors)) %>%
ggplot(aes(visit_date,all_visitors)) +
geom_line() +
geom_smooth(method = "loess", color = "blue", span = 1/7) +
labs(y = "All visitors", x = "Date")## `geom_smooth()` using formula = 'y ~ x'
Here, the black line is the date and the blue line corresponds to a smoothing fit with a corresponding grey confidence area. We see again the weekly period and also the impact of the aforementioned Golden Week, which in 2016 happened between Apr 29 and May 5.
Seaborn doesn’t have a direct equivalent to the
geom_smooth function with the loess method. However, we can
achieve a similar effect using the seaborn.lineplot function combined
with a lowess smoother from the statsmodels
library.
## 'en_US.utf8'
# Set the language to English
#plt.rcParams['axes.formatter.use_locale'] = False
#plt.rcParams['pgf.rcfonts'] = False
#plt.rcParams['pdf.fonttype'] = 42
p1_data = air_visits.groupby('visit_date')['visitors'].sum().reset_index()
p3_data = air_visits.assign(wday=air_visits['visit_date'].dt.day_name()).groupby('wday')['visitors'].median().reset_index()
p4_data = air_visits.assign(month=air_visits['visit_date'].dt.strftime('%b')).groupby('month')['visitors'].median().reset_index()
plt.style.use('ggplot')
fig = plt.figure(figsize=(11, 6))
gs = fig.add_gridspec(2, 3)
# First row
ax1 = fig.add_subplot(gs[0, :])
sns.lineplot(data=p1_data, x='visit_date', y='visitors', color='blue', ax=ax1)
ax1.set_xlabel('All visitors')
ax1.set_ylabel('Date')
# Second row
ax2 = fig.add_subplot(gs[1, 0])
plt.axvline(x=20, color='orange')
plt.xscale('log')
sns.histplot(data=air_visits, x='visitors', bins=30, color='blue', kde=True, ax=ax2)
ax3 = fig.add_subplot(gs[1, 1])
order = [ "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
ax3.set_xticklabels( ('Mon', 'Tue','Wed', 'Thur', 'Fri', 'Sat', 'Sun') )
p3 = sns.barplot(data=p3_data, x='wday', y='visitors', hue='wday',
legend=False, ax=ax3, order=order) #palette='viridis',
p3 = plt.xticks(rotation=45)
#ax3.set_xticklabels(ax3.get_xticklabels(), rotation=45)
ax3.set_xlabel('Days')
ax4 = fig.add_subplot(gs[1, 2])
order = [ "Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
p4 = sns.barplot(data=p4_data, x='month', y='visitors', hue='month',
legend=False, ax=ax4, order=order, palette='viridis')
p4 = plt.xticks(rotation=45)
#ax4.set_xticklabels(ax4.get_xticklabels(), rotation=45)
ax4.set_xlabel('Months')
plt.tight_layout()
plt.show()In this code, we use add_gridspec to create a 2x3 grid of subplots.
Then, we manually add subplots to specific positions using add_subplot.
This way, you have full control over the layout of the plots.
from statsmodels.nonparametric.smoothers_lowess import lowess
import matplotlib.dates as mdates
#locale.setlocale(locale.LC_ALL,'en_US.utf8')
plt.style.use('ggplot')
# Assuming air_visits is your DataFrame
subset_data = air_visits[(air_visits['visit_date'] > '2016-04-15') &
(air_visits['visit_date'] < '2016-06-15')]
# Calculate the total visitors for each visit_date
agg_data = subset_data.groupby('visit_date')['visitors'].sum().reset_index()
#agg_data.visit_date = agg_data.visit_date.dt.strftime('%d %b')
# build the figure
fig, ax = plt.subplots()
# Scatter plot of the data
p = sns.lineplot(data = agg_data, x='visit_date', y='visitors',
color='black', label='Data')
#p= sns.regplot(data=agg_data, x='visit_date', y='visitors',
# color='black', lowess=True, ax=plt.gca())
# set dates format
p = ax.set(xticklabels = agg_data['visit_date'].dt.strftime('%d %b'))## <string>:5: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
p = plt.fill_between(agg_data.visit_date, agg_data.visitors*0.8,
agg_data.visitors*1.2,
alpha=0.25, label='LOWESS uncertainty',
color = "grey")
# put the labels at 45deg since they tend to be too long
#p = plt.xticks(rotation=45)
fig.autofmt_xdate()
# # Ensure that the x.axis text is spacieux
p = ax.xaxis.set_major_locator(mdates.AutoDateLocator())
## Calculate the lowess smoothed line
smoothed = lowess(agg_data['visitors'],
agg_data['visit_date'].astype('datetime64[s]'), frac=1/7)
## Plot the smoothed line
p = plt.plot(agg_data['visit_date'], smoothed[:, 1],
color='blue', label='Lowess Smoother')
# Set labels and legend
p = plt.xlabel('Date')
p = plt.ylabel('All visitors')
plt.legend("")
# Show the plot
plt.show(p)Let’s see how our reservations data compares to the actual visitor numbers. We start with the air restaurants and visualise their visitor volume through reservations for each day, alongside the hours of these visits and the time between making a reservation and visiting the restaurant:
foo <- air_reserve %>%
mutate(reserve_date = date(reserve_datetime),
reserve_hour = hour(reserve_datetime),
reserve_wday = wday(reserve_datetime, label = TRUE, week_start = 1),
visit_date = date(visit_datetime),
visit_hour = hour(visit_datetime),
visit_wday = wday(visit_datetime, label = TRUE, week_start = 1),
diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
diff_day = time_length(visit_datetime - reserve_datetime, unit = "day")
)
p1 <- foo %>%
group_by(visit_date) %>%
summarise(all_visitors = sum(reserve_visitors)) %>%
ggplot(aes(visit_date, all_visitors)) +
geom_line() +
labs(x = "Air visit date")
p2 <- foo %>%
group_by(visit_hour) %>%
summarise(all_visitors = sum(reserve_visitors)) %>%
ggplot(aes(visit_hour, all_visitors)) +
geom_col(fill = "blue")
p3 <- foo %>%
filter(diff_hour < 24*5) %>%
group_by(diff_hour) %>%
summarise(all_visitors = sum(reserve_visitors)) %>%
ggplot(aes(diff_hour, all_visitors)) +
geom_col(fill = "blue") +
labs(x = "Time from reservation to visit (hours)")
layout <- matrix(c(1,1,2,3),2,2, byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)# Assuming 'air_reserve' is your DataFrame
foo = air_reserve.copy()
foo['reserve_date'] = foo['reserve_datetime'].dt.date
foo['reserve_hour'] = foo['reserve_datetime'].dt.hour
foo['reserve_wday'] = foo['reserve_datetime'].dt.day_name()
foo['visit_date'] = foo['visit_datetime'].dt.date
foo['visit_hour'] = foo['visit_datetime'].dt.hour
foo['visit_wday'] = foo['visit_datetime'].dt.day_name()
foo['diff_hour'] = (foo['visit_datetime'] - foo['reserve_datetime']).dt.total_seconds() / 3600
foo['diff_day'] = (foo['visit_datetime'] - foo['reserve_datetime']).dt.days
p1 = foo.groupby('visit_date')['reserve_visitors'].sum().reset_index()
p2 = foo.groupby('visit_hour')['reserve_visitors'].sum().reset_index()
p3 = foo[foo['diff_hour'] < 24*5].groupby('diff_hour')['reserve_visitors'].sum().reset_index()
# Multiplot layout
fig = plt.figure(figsize=(9, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])
p1.plot(x='visit_date', y='reserve_visitors', kind='line', ax=ax1, color = 'black')
p1= ax1.set_xlabel('air visit date')
p1= plt.legend("")
p2.plot(x='visit_hour', y='reserve_visitors', kind='bar', color='blue', ax=ax2)
p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
p2= ax2.xaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")
p3.plot(x='diff_hour', y='reserve_visitors', kind='bar', color='blue', ax=ax3)
p3= ax3.set_xlabel('Time from reservation to visit (hours)')
p3= ax3.set_xticklabels(ax3.get_xticklabels(), rotation=0)
p3= ax3.xaxis.set_major_locator(mdates.AutoDateLocator())
p3= ax3.xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
p3= plt.legend("")
plt.tight_layout()
plt.show()In the same style as above, here are the hpg reservations:
foo <- hpg_reserve %>%
mutate(reserve_date = date(reserve_datetime),
reserve_hour = hour(reserve_datetime),
visit_date = date(visit_datetime),
visit_hour = hour(visit_datetime),
diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
diff_day = time_length(visit_datetime - reserve_datetime, unit = "day")
)
p1 <- foo %>%
group_by(visit_date) %>%
summarise(all_visitors = sum(reserve_visitors)) %>%
ggplot(aes(visit_date, all_visitors)) +
geom_line() +
labs(x = "HPG visit date")
p2 <- foo %>%
group_by(visit_hour) %>%
summarise(all_visitors = sum(reserve_visitors)) %>%
ggplot(aes(visit_hour, all_visitors)) +
geom_col(fill = "red")
p3 <- foo %>%
filter(diff_hour < 24*5) %>%
group_by(diff_hour) %>%
summarise(all_visitors = sum(reserve_visitors)) %>%
ggplot(aes(diff_hour, all_visitors)) +
geom_col(fill = "red") +
labs(x = "Time from reservation to visit (hours)")
layout <- matrix(c(1,1,2,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)We find:
Here the visits after reservation follow a more orderly pattern, with a clear spike in Dec 2016. As above for the air data, we also see reservation visits dropping off as we get closer to the end of the time frame.
Again, most reservations are for dinner, and we see another nice 24-hour pattern for making these reservations. It’s worth noting that here the last few hours before the visit don’t see more volume than the 24 or 48 hours before. This is in stark constrast to the air data.
foo = hpg_reserve.copy()
foo['reserve_date'] = foo['reserve_datetime'].dt.date
foo['reserve_hour'] = foo['reserve_datetime'].dt.hour
foo['visit_date'] = foo['visit_datetime'].dt.date
foo['visit_hour'] = foo['visit_datetime'].dt.hour
foo['diff_hour'] = (foo['visit_datetime'] - foo['reserve_datetime']).dt.total_seconds() / 3600
foo['diff_day'] = (foo['visit_datetime'] - foo['reserve_datetime']).dt.days
p1 = foo.groupby('visit_date')['reserve_visitors'].sum().reset_index()
p2 = foo.groupby('visit_hour')['reserve_visitors'].sum().reset_index()
p3 = foo[foo['diff_hour'] < 24*5].groupby('diff_hour')['reserve_visitors'].sum().reset_index()
# Multiplot layout
fig = plt.figure(figsize=(9, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])
p1.plot(x='visit_date', y='reserve_visitors', kind='line', color='black', ax=ax1)
p1= ax1.set_xlabel('HPG visit date')
#p1= ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
p1= ax1.xaxis.set_major_locator(mdates.AutoDateLocator())
p1= plt.legend("")
p2.plot(x='visit_hour', y='reserve_visitors', kind='bar', color='red', ax=ax2)
p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
p2= ax2.xaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")
p3.plot(x='diff_hour', y='reserve_visitors', kind='bar', color='red', ax=ax3)
p3= ax3.set_xlabel('Time from reservation to visit (hours)')
p3= ax3.set_xticklabels(ax3.get_xticklabels(), rotation=0)
p3= ax3.xaxis.set_major_locator(mdates.AutoDateLocator())
p3= ax3.xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
p3= plt.legend("")
plt.tight_layout()
plt.show()After visualising the temporal aspects, let’s now look at the spatial
information. Note, that according to the data
description the
latitude and longitude are the latitude and longitude of the area to which the store belongs.
This is meant to discourage us from identifying specific restaurants. I
would be surprised if nobody tried anyway, though.
This is a fully interactive and zoomable map of all the air restaurants. Click on the clusters to break them up into smaller clusters and ultimately into the individual restaurants, which are labelled by their genre. The map nicely visualises the fact that many restaurants share common coordinates, since those coordinates refer to the area of the restaurant. Click on the single markers to see their air_store_id. The map is powered by the leaflet package, which includes a variety of cool tools for interactive maps. Have fun exploring!
leaflet(air_store) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addMarkers(~longitude, ~latitude,
popup = ~air_store_id, label = ~air_genre_name,
clusterOptions = markerClusterOptions())Next, we plot the numbers of different types of cuisine (or air_genre_names) alongside the areas with the most air restaurants:
p1 <- air_store %>%
group_by(air_genre_name) %>%
count() %>%
ggplot(aes(reorder(air_genre_name, n, FUN = min), n, fill = air_genre_name)) +
geom_col() +
coord_flip() +
theme(legend.position = "none") +
labs(x = "Type of cuisine (air_genre_name)", y = "Number of air restaurants")
p2 <- air_store %>%
group_by(air_area_name) %>%
count() %>%
ungroup() %>%
top_n(15,n) %>%
ggplot(aes(reorder(air_area_name, n, FUN = min) ,n, fill = air_area_name)) +
geom_col() +
theme(legend.position = "none") +
coord_flip() +
labs(x = "Top 15 areas (air_area_name)", y = "Number of air restaurants")
layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)We find:
There are lots of Izakaya gastropubs in our data, followed by
Cafe’s. We don’t have many Karaoke places in the air data set and also
only a few that describe themselves as generically
International or Asian. I have to admit, I’m
kind of intrigued by creative cuisine.
Fukuoka has the largest number of air restaurants per area, followed by many Tokyo areas.
In Python, you can achieve a similar map using the folium library.
import warnings
warnings.filterwarnings('ignore')
# Create a map centered at a specific location
m = folium.Map(location=[35.6528, 139.8395], zoom_start=7, width=1500, height=800)
# Add tiles (you can choose different providers)
m=folium.TileLayer(tiles='CartoDB positron').add_to(m)
# Create a MarkerCluster object
marker_cluster = MarkerCluster().add_to(m)
## Add markers from the air_store dataset
# for index, row in air_store.iterrows():
# folium.Marker(location=[row['latitude'], row['longitude']],
# popup=row['air_store_id'],
# icon=None,
# tooltip=row['air_genre_name']).add_to(marker_cluster)
# Define a function to add markers
def add_marker(row):
folium.Marker(location=[row['latitude'], row['longitude']],
popup=row['air_store_id'],
icon=None,
tooltip=row['air_genre_name']).add_to(marker_cluster)
# Apply the function to each row of the DataFrame
m= air_store.apply(add_marker, axis=1)
# Display the map
m## 0 None
## 1 None
## 2 None
## 3 None
## 4 None
## ...
## 824 None
## 825 None
## 826 None
## 827 None
## 828 None
## Length: 829, dtype: object
p1_data = air_store['air_genre_name'].value_counts().reset_index()
p1_data.columns = ['air_genre_name', 'count']
p1_data = p1_data.sort_values(by=['count'], ascending=False).reset_index(drop=True)
p2_data = air_store['air_area_name'].value_counts().nlargest(15).reset_index()
p2_data.columns = ['air_area_name', 'count']
p2_data = p2_data.sort_values(['count'], ascending=False).reset_index(drop=True)
# Multiplot layout
fig = plt.figure(figsize=(10, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, :])
# Plot 1
p1= sns.barplot(data=p1_data, x='count', y='air_genre_name', order=p1_data['air_genre_name'],
palette='dark', ax=ax1, legend=False)
p1= ax1.set_xlabel('Number of air restaurants')
p1= ax1.set_ylabel("Type of cuisine")
#p1= ax1.set_title("Number of air restaurants by cuisine type")
#p1= ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)
#p1= ax1.yaxis.set_major_locator(mdates.AutoDateLocator())
#p1= plt.legend("")
# Plot 2
p2= sns.barplot(data=p2_data, x='count', y='air_area_name', order=p2_data['air_area_name'],
palette='bright', ax=ax2, legend=False, width=0.8)
p2= ax2.set_xlabel("Number of air restaurants")
p2= ax2.set_ylabel("Top 15 areas")
#p2= ax2.set_title("Top 15 areas with most air restaurants")
#p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
#p2= ax2.yaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")
plt.tight_layout()
plt.show()In the same way as for the air stores above, we also create an interactive map for the different hpg restaurants:
leaflet(hpg_store) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addMarkers(~longitude, ~latitude,
popup = ~hpg_store_id, label = ~hpg_genre_name,
clusterOptions = markerClusterOptions())Here is the breakdown of genre and area for the hpg restaurants
p1 <- hpg_store %>%
group_by(hpg_genre_name) %>%
count() %>%
ggplot(aes(reorder(hpg_genre_name, n, FUN = min), n, fill = hpg_genre_name)) +
geom_col() +
coord_flip() +
theme(legend.position = "none") +
labs(x = "Type of cuisine (hpg_genre_name)", y = "Number of hpg restaurants")
p2 <- hpg_store %>%
mutate(area = str_sub(hpg_area_name, 1, 20)) %>%
group_by(area) %>%
count() %>%
ungroup() %>%
top_n(15,n) %>%
ggplot(aes(reorder(area, n, FUN = min) ,n, fill = area)) +
geom_col() +
theme(legend.position = "none") +
coord_flip() +
labs(x = "Top 15 areas (hpg_area_name)", y = "Number of hpg restaurants")
layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)Here we have truncated the hpg_area_name to 20 characters to make the plot more readable.
We find:
The hpg description contains a larger variety of genres than in
the air data. Here, Japanese style appears to contain many
more places that are categorised more specifically in the air data. The
same applies to International cuisine.
In the top 15 area we find again Tokyo and Osaka to be prominently present.
# Create a map centered at a specific location
m = folium.Map(location=[35.6528, 139.8395], zoom_start=7, width=1500, height=800)
# Add tiles (you can choose different providers)
folium.TileLayer(tiles='CartoDB positron').add_to(m)## <folium.raster_layers.TileLayer object at 0x7f9514736770>
# Create a MarkerCluster object
marker_cluster = MarkerCluster().add_to(m)
## Also working
# hpg_store.apply(lambda row:folium.Marker(
# location=[row['latitude'], row['longitude']],
# popup=row['hpg_store_id'],icon=None,
# tooltip=row['hpg_genre_name']).add_to(marker_cluster),
# axis=1)
# Define a function to add markers
def add_marker(row):
folium.Marker(location=[row['latitude'], row['longitude']],
popup=row['hpg_store_id'],
icon=None,
tooltip=row['hpg_genre_name']).add_to(marker_cluster)
# Apply the function to each row of the DataFrame
hpg_store.apply(add_marker, axis=1)## 0 None
## 1 None
## 2 None
## 3 None
## 4 None
## ...
## 4685 None
## 4686 None
## 4687 None
## 4688 None
## 4689 None
## Length: 4690, dtype: object
p1_data = hpg_store['hpg_genre_name'].value_counts().reset_index()
p1_data.columns = ['hpg_genre_name', 'count']
p1_data = p1_data.sort_values(by=['count'], ascending=False).reset_index(drop=True)
p2_data = hpg_store['hpg_area_name'].value_counts().nlargest(15).reset_index()
p2_data.columns = ['hpg_area_name', 'count']
p2_data = p2_data.sort_values(['count'], ascending=False).reset_index(drop=True)
# Multiplot layout
fig = plt.figure(figsize=(12, 8))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
ax1 = fig.add_subplot(gs[:, 0])
ax2 = fig.add_subplot(gs[:, 1])
# Plot 1
p1= sns.barplot(data=p1_data, x='count', y='hpg_genre_name', order=p1_data['hpg_genre_name'],
palette='dark', ax=ax1, legend=False)
p1= ax1.set_xlabel('Number of hpg restaurants')
p1= ax1.set_ylabel("Type of cuisine")
#p1= ax1.set_title("Number of hpg restaurants by cuisine type")
#p1= ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)
#p1= ax1.yaxis.set_major_locator(mdates.AutoDateLocator())
#p1= plt.legend("")
# Plot 2
p2= sns.barplot(data=p2_data, x='count', y='hpg_area_name', order=p2_data['hpg_area_name'],
palette='bright', ax=ax2, legend=False, width=0.8)
p2= ax2.set_xlabel("Number of hpg restaurants")
p2= ax2.set_ylabel("Top 15 areas")
#p2= ax2.set_title("Top 10 areas with most hpg restaurants")
#p2= ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
#p2= ax2.yaxis.set_major_locator(mdates.AutoDateLocator())
p2= plt.legend("")
plt.tight_layout()
plt.show()Let’s have a quick look at the holidays. We’ll plot how many there are in total and also how they are distributed during our prediction time range in 2017 and the corresponding time in 2016:
foo <- holidays %>%
mutate(wday = wday(date, week_start = 1))
p1 <- foo %>%
ggplot(aes(holiday_flg, fill = holiday_flg)) +
geom_bar() +
theme(legend.position = "none")
p2 <- foo %>%
filter(date > ymd("2016-04-15") & date < ymd("2016-06-01")) %>%
ggplot(aes(date, holiday_flg, color = holiday_flg)) +
geom_point(size = 2) +
theme(legend.position = "none") +
labs(x = "2016 date")
p3 <- foo %>%
filter(date > ymd("2017-04-15") & date < ymd("2017-06-01")) %>%
ggplot(aes(date, holiday_flg, color = holiday_flg)) +
geom_point(size = 2) +
theme(legend.position = "none") +
labs(x = "2017 date")
layout <- matrix(c(1,1,2,3),2,2,byrow=FALSE)
multiplot(p1, p2, p3, layout=layout)We find:
The same days were holidays in late Apr / May in 2016 as in 2017.
There are about 7% holidays in our data:
## frac
## 1 0.06769826
# Convert 'date' column to datetime if it's not already
holidays['date'] = pd.to_datetime(holidays['date'])
# Plot 1
p1_data = holidays['holiday_flg'].value_counts().reset_index()
p1_data.columns = ['holiday_flg', 'count']
p2_data = holidays[(holidays['date'] > '2016-04-15') & (holidays['date'] < '2016-06-01')]
p2_data['holiday_flg'] = p2_data['holiday_flg'].astype('category')
p3_data = holidays[(holidays['date'] > '2017-04-15') & (holidays['date'] < '2017-06-01')]
# Multiplot layout
fig = plt.figure(figsize=(6, 4))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
locale.setlocale(locale.LC_ALL,'en_US.utf8')## 'en_US.utf8'
ax1 = fig.add_subplot(gs[:, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[1, 1])
# Plot1
p1= sns.barplot(data=p1_data, x='holiday_flg', y='count', palette='viridis',
ax=ax1, legend=False)
p1= ax1.set_xlabel('Holiday Flag')
#p1= ax1.set_title("Distribution of Holiday Flag")
p1= plt.legend("")
# Plot 2
p2= sns.scatterplot(data=p2_data, x='date', y='holiday_flg', hue='holiday_flg',
palette='viridis', ax=ax2, legend=False)
p2.set_yticks([1.0, 0.0], ["True","False"])
p2= ax2.set_xlabel('2016')
p2= ax2.set_title("2016")
#p2 = ax2.set(xticklabels = p2_data['date'].dt.strftime('%d %b'))
p2= fig.autofmt_xdate()
p2= ax2.set_ylabel('')
p2= plt.legend("")
p2= ax2.xaxis.set_major_locator(mdates.AutoDateLocator())
# Plot 3
p3= sns.scatterplot(data=p3_data, x='date', y='holiday_flg', hue='holiday_flg', palette='viridis', ax=ax3, legend=False)
p3.set_yticks([1.0, 0.0], ["True","False"])
p3 = ax3.set(xticklabels = p3_data['date'].dt.strftime('%d %b'))
p3= fig.autofmt_xdate()
p3= ax3.set_ylabel('')
p3= ax3.xaxis.set_major_locator(mdates.AutoDateLocator())
p3= plt.xlabel("2017")
#p3= plt.title("Holiday Flag in 2017")
p3= plt.legend("")
plt.tight_layout( pad=0.4, w_pad=0.5, h_pad=1)
plt.show()## 0.068
The following plot visualises the time range of the train vs test data sets:
foo <- air_visits %>%
rename(date = visit_date) %>%
distinct(date) %>%
mutate(dset = "train")
bar <- test %>%
separate(id, c("foo", "bar", "date"), sep = "_") %>%
mutate(date = ymd(date)) %>%
distinct(date) %>%
mutate(dset = "test")
foo <- foo %>%
bind_rows(bar) %>%
mutate(year = year(date))
year(foo$date) <- 2017
foo %>%
filter(!is.na(date)) %>%
mutate(year = fct_relevel(as.factor(year), c("2017","2016"))) %>%
ggplot(aes(date, year, color = dset)) +
geom_point(shape = "|", size = 10) +
scale_x_date(date_labels = "%B", date_breaks = "1 month") +
#scale_y_reverse() +
theme(legend.position = "bottom", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
labs(color = "Data set") +
guides(color = guide_legend(override.aes = list(size = 4, pch = 15)))# Renaming and selecting distinct dates for 'foo'
foo = air_visits.rename(columns={'visit_date': 'date'}).drop_duplicates(subset=['date'])
foo['dset'] = 'train'
# Processing 'bar'
test['date'] = pd.to_datetime(test['id'].str.split('_').str[-1])
bar = test[['date']].drop_duplicates()
bar['dset'] = 'test'
# Combining 'foo' and 'bar'
foo = pd.concat([foo, bar], ignore_index=True)
# Setting 'year' column
foo['year'] = foo['date'].dt.strftime('%Y').astype('category')
# Filtering out NA dates
foo = foo.dropna(subset=['date'])
# Adding a 'Months' column
foo['Months'] = foo['date'].dt.strftime('%B')
# Set up the plot
plt.figure(figsize=(8, 4))
#p= sns.set(rc={'figure.figsize':(4,2)})
plt.style.use('ggplot')
# Create a scatter plot using Seaborn
p= sns.scatterplot(data=foo, x='Months', y= 'year' , hue='dset', color= 'dset',
#palette={'train': 'blue', 'test': 'red'},
marker='*',
s=250)
# Format the tick labels to display the month name
#p.invert_yaxis()
p= plt.xticks(rotation=35)
p= plt.tick_params(axis='x', pad=-5, labelsize=8)
plt.margins(y=1)
plt.legend(bbox_to_anchor=(0.99, 0.01), loc='lower right', borderaxespad=0)
plt.show(p)After looking at every data set individually, let’s get to the real fun and start combining them. This will tell us something about the relations between the various features and how these relationsy might affect the visitor numbers. Any signal we find will need to be interpreted in the context of the individual feature distributions; which is why it was one of our first steps to study those.
Our first plot of the multi-feature space deals with the average number of air restaurant visitors broken down by type of cuisine; i.e. the air_genre_name. We use a facet plot to distinguish the time series for the 14 categories. Note the logarithmic y-axis:
foo <- air_visits %>%
left_join(air_store, by = "air_store_id")
foo %>%
group_by(visit_date, air_genre_name) %>%
summarise(mean_visitors = mean(visitors)) %>%
ungroup() %>%
ggplot(aes(visit_date, mean_visitors, color = air_genre_name)) +
geom_line() +
labs(y = "Average number of visitors to 'air' restaurants", x = "Date") +
theme(legend.position = "none") +
scale_y_log10() +
facet_wrap(~ air_genre_name)## `summarise()` has grouped output by 'visit_date'. You can override using the
## `.groups` argument.
We find:
The mean values range between 10 and 100 visitors per genre per
day. Within each category, the long-term trend looks reasonably stable.
There is an upward trend for Creative Cuisine and
Okonomiyaki et al., while the popularity of
Asian food has been declining since late 2016.
The low-count time series like Karaoke or
Asian (see Fig. 6) are understandably more noisy than the
genres with higher numbers of visitors. Still, Asian
restaurants appear to be very popular despite (or because of?) their
rarity.
In all of the curves we see the familiar weekly variation, so let’s look in more detail at the mean visitor numbers per week day and genre. We also add to this a series of ridgeline plots via the ggridges package. Ridgeline plots allow for a quick comparison of semi-overlapping (density) curves. Here we show the distribution of visitors per day for each genre. Through a little bit of ggplot magic, the y-axis labels in between those two plots refer to both of them:
p1 <- foo %>%
mutate(wday = wday(visit_date, label = TRUE, week_start = 1)) %>%
group_by(wday, air_genre_name) %>%
summarise(mean_visitors = mean(visitors)) %>%
ggplot(aes(air_genre_name, mean_visitors, color = wday)) +
geom_point(size = 4) +
theme(legend.position = "left", axis.text.y = element_blank(),
plot.title = element_text(size = 14)) +
coord_flip() +
labs(x = "") +
scale_x_discrete(position = "top") +
ggtitle("air_genre_name") +
scale_color_hue()## `summarise()` has grouped output by 'wday'. You can override using the
## `.groups` argument.
p2 <- foo %>%
ggplot(aes(visitors, air_genre_name, fill = air_genre_name)) +
geom_density_ridges(bandwidth = 0.1) +
scale_x_log10() +
theme(legend.position = "none") +
labs(y = "") +
scale_fill_cyclical(values = c("blue", "red"))
layout <- matrix(c(1,1,2,2,2),1,5,byrow=TRUE)
multiplot(p1, p2, layout=layout)Here each colour corresponds to a day of the week. Red-ish coulours are the weekend, while the cooler colours are the middle of the week. Monday is a dark orange.
We find:
The biggest difference between weekend and weekdays exists for
the Karaoke bars, which rule the weekend. A similar trend,
although with a considerably smaller gap, can be seen for the
International cuisine.
No genre really goes against the trend of busier weekends. The
smallest variations are in the generic Other category, the
Japanese food, and also the Korean cuisine
which is the only category where Fridays are the busiest days. General
Bars/Cocktail are notably unpopular overall.
The density curves confirm the impression we got from the
week-day distribution: the Asian restaurants have rarely
less than 10 visitors per date and the Karaoke places show
a very broad distribution due to the strong impact of the weekends. Note
the logarithmic x-axis
# Merge 'air_visits' and 'air_store' on the column 'air_store_id'
foo = pd.merge(air_visits, air_store, on='air_store_id', how='left')
#Grouping and summarizing data
summary_data = foo.groupby(['visit_date', 'air_genre_name'],
observed=True)['visitors'].mean().reset_index()
fig = plt.figure(figsize = (4,2))
plt.style.use('ggplot')
# Set up the FacetGrid
g = sns.FacetGrid(summary_data, col="air_genre_name",
col_wrap=4, height=2, aspect= 1,
sharex=True)
# Create line plots for each genre
p = g.map_dataframe(sns.lineplot, x='visit_date',
y='visitors', hue='air_genre_name')
# Resize facet titles
p= g.set_titles(col_template="{col_name}",
size = 8, backgroundcolor='#DDDDDD', pad=2
# bbox={'facecolor': '#DDDDDD', 'alpha': 0, 'pad': 0}
)
p= g.tick_params(axis='x', pad=0, labelsize= 6)
p= g.set_xlabels(fontsize=6, label="Date")
p= g.set_ylabels(fontsize=8 )#,label="Average number of visitors")
# Set labels and title
#g.set_axis_labels(x_var="Date",y_var= "Average number of visitors")
#g.add_legend(title="gendre")
for ax in g.axes.flat:
for label in ax.get_xticklabels():
label.set_rotation(45)
# Apply logarithmic scale to y-axis
p= g.set(yscale="log")
# Adjust the layout
p= g.tight_layout()
# Show the plot
plt.show(p)# Extracting weekday
foo['wday'] = foo['visit_date'].dt.day_name()
# Data Processing for p1
p1_data = foo.groupby(['wday', 'air_genre_name'], observed=True).agg(mean_visitors=('visitors', 'mean')).reset_index()
# Data Processing for p2 (Assuming 'foo' DataFrame is available)
p2_data = foo.copy()
# Apply necessary transformations to 'p2_data' DataFrame
# Create a 2x2 grid
fig = plt.figure(figsize=(12, 8))
gs = GridSpec(2, 2, figure=fig)
# Plot 1 (Top Left)
ax1 = fig.add_subplot(gs[:, 0])
sns.scatterplot(data=p1_data,
y='air_genre_name',
x='mean_visitors',
hue='wday',
size=4,
#palette='viridis',
ax=ax1)
# Set the figure size using rcParams
ax1.set_title('air_genre_name', fontsize=8)
ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.legend(title='wday', loc='center right')
#ax1.legend("")
ax2 = fig.add_subplot(gs[:, 1],sharex=True)## 'other' must be an instance of matplotlib.axes._base._AxesBase, not a bool
## Plot 2 (Top Right)
ax2 = fig.add_subplot(gs[:, 1])
for genre in p2_data['air_genre_name'].unique():
sns.kdeplot(data=p2_data, #p2_data[p2_data['air_genre_name'] == genre]
x='visitors',
hue='air_genre_name',
fill=True,
common_norm=False,
levels=30,
label=genre,
ax=ax2)
ax2.set_xlabel('')
ax2.set_ylabel('')
ax2.set_title('')
#ax2.legend("",loc=2)
p= ax2.set_xlim(-50, 200)
# Adjust layout
p= plt.tight_layout()
# Show the plots
plt.show(p)# Data Processing for p2 (Assuming 'foo' DataFrame is available)
p2_data = foo.copy()
fig = plt.figure(figsize=(10, 8))
# Plot 2 (Bottom)
pal = sns.cubehelix_palette(10, rot=-.25, light=.7)
g = sns.FacetGrid(p2_data,
row="air_genre_name",
hue="air_genre_name", #palette=pal,
aspect=15, height=.5
)
p= g.map(sns.kdeplot, 'visitors',
bw_adjust=.5, clip_on=False,
fill=True, alpha=1, linewidth=.5)
p= g.map(sns.kdeplot, 'visitors', clip_on=False, color="black", lw=2, bw_adjust=.5)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, color=color, #fontweight="bold",
ha="right", va="center", transform=ax.transAxes)
p= g.map(label, "visitors")
p= g.set(xlim=(-50, 1000))
p= g.set_titles("")
p= g.set(yticks=[], ylabel="")
p= g.despine(bottom=True, left=True)
p= g.fig.subplots_adjust(hspace=-.25)
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show(p)We will study the influence of holidays on our visitor numbers by comparing the statistics for days with holidays vs those without holiday flag:
foo <- air_visits %>%
mutate(calendar_date = as.character(visit_date)) %>%
left_join(holidays, by = "calendar_date")
p1 <- foo %>%
ggplot(aes(holiday_flg, visitors, color = holiday_flg)) +
geom_boxplot() +
scale_y_log10() +
theme(legend.position = "none")
p2 <- foo %>%
mutate(wday = wday(date, label = TRUE, week_start = 1)) %>%
group_by(wday, holiday_flg) %>%
summarise(mean_visitors = mean(visitors)) %>%
ggplot(aes(wday, mean_visitors, color = holiday_flg)) +
geom_point(size = 4) +
theme(legend.position = "none") +
labs(y = "Average number of visitors")## `summarise()` has grouped output by 'wday'. You can override using the
## `.groups` argument.
We find:
Overall, holidays don’t have any impact on the average visitor numbers (left panel). As so often, more information is hidden in the details.
While a weekend holiday has little impact on the visitor numbers, and even decreases them slightly, there is a much more pronounced effect for the weekdays; especially Monday and Tuesday (right panel).
from matplotlib import pyplot as plt, dates
# Data Processing
foo = air_visits.copy()
#foo['calendar_date'] = foo['visit_date'].dt.strftime('%Y-%m-%d')
foo['calendar_date'] = foo['visit_date'].astype(str)
#foo = foo.merge(holidays, left_on='calendar_date', right_on='date', how='left')
foo = pd.merge(foo, holidays, left_on='calendar_date', how='left', right_on = 'calendar_date')
# Data Processing for p1
p1_data = foo.copy()
p1_data['holiday_flg'] = p1_data['holiday_flg'].astype(bool) # Convert to boolean
p1_data['wday'] = p1_data['visit_date'].dt.day_name() # Extract weekday
# Data Processing for p2
# Data Processing for Plot 2
p2_data = foo.copy()
order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
p2_data['wday'] = pd.Categorical(p2_data['visit_date'].dt.day_name(), categories= order, ordered=True)
p2_data['wdayn'] = p2_data['date'].dt.dayofweek # Extract weekday as a number (0-6)
p2_data = p2_data.groupby(['wday', 'holiday_flg', 'wdayn'], observed=True).agg(mean_visitors=('visitors', 'mean')).reset_index()
p2_data['wdayn'] = pd.Categorical(p2_data['wdayn'], categories=[0,1,2,3,4,5,6], ordered=True) # Sort weekdays
#p2_data['wdayn'] = p2_data['wdayn'].dt.day_name()
# Create a 1x2 grid
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
locale.setlocale(locale.LC_ALL,'en_US.utf8')## 'en_US.utf8'
plt.style.use('ggplot')
# Plot 1 (p1)
flierprops = dict(marker='o', markerfacecolor='r', markersize=10,
linestyle='none', markeredgecolor='b')
sns.boxplot(data=p1_data, x='holiday_flg', y='visitors',
hue='holiday_flg',
#flierprops=flierprops,
fill=False,
#dodge=False,
showfliers = True,
ax=axes[0])
axes[0].set_yscale('log')
#axes[0].set_title('Boxplot of Visitors on Holidays')
#axes[0].set_xlabel('Holiday')
axes[0].set_ylabel('Visitors (log scale)')
axes[0].legend('')
# Plot 2 (p2)
sns.scatterplot(data=p2_data, x='wday', y='mean_visitors',
hue='holiday_flg',
ax=axes[1], s=50)
#axes[1].set_title('Average Visitors by Weekday')
axes[1].set_xlabel('wday')
axes[1].set_ylabel('Average Visitors')
axes[1].xaxis.set_ticks(order)
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
#axes[1].xaxis.set_major_formatter(dates.DateFormatter('%a'))
axes[1].legend('')
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()Our next exploration follows from a simple thought: if gastropubs are popular and we own the only gastropub in the area then we can expect lots of customers. If there are twelve other gastropubs in our street then, try as we might, some of those customers will venture into other establishments. Economists tell us that we can ultimately expect a convergence towards an equilibrium between supply and demand. But for snapshots like our data set, and for relatively localised areas, there might still be merit in investigating restaurant clustering. Therefore, let’s study the number of restaurants of a certain genre per area and their impact on visitor numbers.
We begin with an overview plot of the frequency of certain genres per area for the two data sets of air and hpg stores. This could have well been a part of the previous chapter, but I only just thought about it ;-) . The following count plots show which genres exist in which areas (names truncated). The size of the dots is proportional to the number of cases:
air_store %>%
mutate(area = str_sub(air_area_name, 1, 12)) %>%
ggplot(aes(area, air_genre_name)) +
geom_count(colour = "blue") +
theme(legend.position = "bottom", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9))We find:
Some areas have lots of restaurants and much variety, whereas others contain only a single air restaurant. Large parts of the parameter space are empty.
Similarly, some cuisines like Izakaya or
Cafe are pretty ubiqutous, whereas others can only be found
in a few areas. Note, that the only 2 Karaoke bars in the air sample are
in Hokkaido Sapporo-shi Minami 3 Jonishi, whereas the only
2 International cuisine restaurants as well as the only two
Asian places can be found in
Tokyo-to Shibuya-ku Shibuya.
The same kind of plot for the hpg data looks similar albeit more busy due to the larger number of genres:
hpg_store %>%
mutate(area = str_sub(hpg_area_name, 1, 10)) %>%
ggplot(aes(area, hpg_genre_name)) +
geom_count(colour = "red") +
theme(legend.position = "bottom", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9))We find:
Also here there are busy areas and those with only a few restaurants. Unsurprisingly, Tokyo features prominently in the areas with a lot of culinary diversity.
Japanese style and
International cuisine are popular pretty much everywhere.
Amusement bars and Udon/Soba places are rare,
as are Shanghai food or Dim Sum.
The count plots tell us that there is a distribution of how many
restaurants of a certain genre can be found per area. Here we look at
these distributions in detail via boxplots with overlayed jitter plots.
The genres are ordered by decreasing mean cases per area, i.e. the mean
of a horizontal sequence of dots in a count plot. The we overlay the
indvidual data point and assign each dot a random jitter to visually
separate otherwise overlapping data. Here, the y axis
(i.e. Occurences per area) correspond to the size of the
dots in the count plots above. We’re using single plots here, instead of
panels, because these plots are quite detailed. Note the logarithmic
y-axes.
We start with the air data:
air_store %>%
group_by(air_genre_name, air_area_name) %>%
count() %>%
ggplot(aes(reorder(air_genre_name, n, FUN = mean), n)) +
geom_boxplot() +
geom_jitter(color = "blue") +
scale_y_log10() +
coord_flip() +
labs(x = "Air genre", y = "Occurences per air area")We find:
Only few genres have medians of more than 2 restaurants per area.
Examples are Italian/French restaurants or
Bar/Cocktail places, which are more likely to be found in
groups of more than 2 per area.
For the majority of genres the distribution is firmly clustered
around 2 cases per area with a bit of scatter towards higher numbers.
Cafes have the highest number with 26 occurences in a
single area (Fukuoka-ken Fukuoka-shi Daimyō).
Curiously, the minimum here is 2, not 1. This means that there is
no air restaurant that is the only one of it’s genre in any area. You
can see that also on the map in Fig. 5: whenever there is a group of 2
restaurants at the same area coordinates they have the same genre (but
different air_store_ids). Similarly, there are no single occurrences of
a genre in any larger group per area. (There exist odd numbers of
occurences though, i.e. 3 or 5 of the same genre per area.) I’m not
quite sure what to make of that, since it seems too perfectly matched to
be true. Could it have something to do with the way the air data have
been selected? Might it be a bug in assigning genre names? I checked a
few examples of 2 restaurants in one spot and they have different
visitor numbers on the same day, e.g. for this pair of
Dining Bars, indicating that we don’t simply have a problem
of single entries being duplicated here:
air_store %>%
filter(air_store_id == "air_b5598d12d1b84890" | air_store_id == "air_bbe1c1a47e09f161")## air_store_id air_genre_name air_area_name latitude
## 1: air_bbe1c1a47e09f161 Dining bar Tōkyō-to Setagaya-ku Kitazawa 35.66266
## 2: air_b5598d12d1b84890 Dining bar Tōkyō-to Setagaya-ku Kitazawa 35.66266
## longitude
## 1: 139.6683
## 2: 139.6683
air_visits %>%
filter(air_store_id == "air_b5598d12d1b84890" | air_store_id == "air_bbe1c1a47e09f161") %>%
arrange(visit_date) %>%
head(10)## air_store_id visit_date visitors
## 1: air_bbe1c1a47e09f161 2016-07-01 7
## 2: air_b5598d12d1b84890 2016-07-01 5
## 3: air_b5598d12d1b84890 2016-07-02 14
## 4: air_b5598d12d1b84890 2016-07-03 6
## 5: air_b5598d12d1b84890 2016-07-04 6
## 6: air_b5598d12d1b84890 2016-07-05 5
## 7: air_b5598d12d1b84890 2016-07-06 5
## 8: air_bbe1c1a47e09f161 2016-07-07 1
## 9: air_bbe1c1a47e09f161 2016-07-08 7
## 10: air_b5598d12d1b84890 2016-07-08 10
Now we look at the same distribution for the HPG restaurants:
foobar <- hpg_store %>%
group_by(hpg_genre_name, hpg_area_name) %>%
count()
foobar %>%
ggplot(aes(reorder(hpg_genre_name, n, FUN = mean), n)) +
geom_boxplot() +
geom_jitter(color = "red") +
scale_y_log10() +
coord_flip() +
labs(x = "hpg genre", y = "Cases per hpg area")We find:
Here we clearly have a minimum of 1 genre per area, and also much more variety in median cases due to the higher overall numbers.
The most extreme genre is Japanese style for which
the median is just above 10 restaurants per area. Alongside of this,
there a number of other genres for which the lower box hinge is not
touching the minimum of 1 case per area.
Using the information on the number of genres in each area we can now
proceed to quantify the clustering, or crowdedness, of our
data set and relate it to the visitor numbers. The next plot first shows
the overall distribution of the air and hpg data points from the last
two plots (i.e. cases of the same genre per area).
In addition, we estimate the mean visitor numbers for each clustering case. For this, we first take the mean of the log1p-transformed visitor numbers (per genre and area) and then compute the mean and standard deviation of these numbers for each case, i.e. number of occurences of the same genre in an area. (log1p means log(x+1) and is intended to make the visitors number distribution more normal; see Fig. 1).
foo <- air_visits %>%
left_join(air_store, by = "air_store_id")
bar <- air_store %>%
group_by(air_genre_name, air_area_name) %>%
count()
foobar <- hpg_store %>%
group_by(hpg_genre_name, hpg_area_name) %>%
count()
p1 <- bar %>%
ggplot(aes(n)) +
geom_histogram(fill = "blue", binwidth = 1) +
labs(x = "Air genres per area")
p2 <- foobar %>%
ggplot(aes(n)) +
geom_histogram(fill = "red", binwidth = 1) +
labs(x = "HPG genres per area")
p3 <- foo %>%
group_by(air_genre_name, air_area_name) %>%
summarise(mean_log_visit = mean(log1p(visitors))) %>%
left_join(bar, by = c("air_genre_name","air_area_name")) %>%
group_by(n) %>%
summarise(mean_mlv = mean(mean_log_visit),
sd_mlv = sd(mean_log_visit)) %>%
replace_na(list(sd_mlv = 0)) %>%
ggplot(aes(n, mean_mlv)) +
geom_point(color = "blue", size = 4) +
geom_errorbar(aes(ymin = mean_mlv - sd_mlv, ymax = mean_mlv + sd_mlv), width = 0.5, size = 0.7, color = "blue") +
labs(x = "Cases of identical Air genres per area", y = "Mean +/- SD of\n mean log1p visitors")## `summarise()` has grouped output by 'air_genre_name'. You can override using
## the `.groups` argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We find:
Upper panels: The numbers of cases of identical genres in the
same area drops quickly, possibly exponentially, from the maxima of
2 and 1 for air and hpg data sets,
respectively. There is a longer tail towards larger case numbers, with a
possible 2nd peak around 8 for air and about
13 for hpg.
Lower panel: The log1p means of visitor numbers show relatively
large spread and are perfectly consistent in the range from
2 cases to 9. From 10 cases upward we simply
don’t have the numbers to measure a spread: there are two data points
with 11 cases per area and otherwise it’s just a single measurement.
However, it is noteworthy that the scatter among all the data points
> 9 cases is pretty low, and that the lie notably higher than the
means of the points <= 9. Now, that could simply mean that those high
visitor numbers are not being brought down by small (less
busy?) areas, but that in itself is an interesting result.
Note that the scatter plot in the lower panel mixes treats all the clustering/crowding cases regardless of genre. We can of course only draw this plot for the air data for which we have visitor numbers.
# Create a new column 'area' by extracting the first 12 characters from 'air_area_name'
air_store['area'] = air_store['air_area_name'].str[:12]
#Count the frequency of each combination of 'area' and 'air_genre_name'
counts = air_store.groupby(['area', 'air_genre_name'], observed=True).size().reset_index(name='count')
# Plotting
plt.figure(figsize=(8, 4))
sns.scatterplot(data=counts, x='area', y='air_genre_name',
size='count', color='blue', legend=True)
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [Text(0, 0, 'Fukuoka-ken '), Text(1, 0, 'Hiroshima-ke'), Text(2, 0, 'Hokkaidō Aba'), Text(3, 0, 'Hokkaidō Asa'), Text(4, 0, 'Hokkaidō Kat'), Text(5, 0, 'Hokkaidō Sap'), Text(6, 0, 'Hyōgo-ken Am'), Text(7, 0, 'Hyōgo-ken Hi'), Text(8, 0, 'Hyōgo-ken Ka'), Text(9, 0, 'Hyōgo-ken Kō'), Text(10, 0, 'Hyōgo-ken Ni'), Text(11, 0, 'Hyōgo-ken Ta'), Text(12, 0, 'Miyagi-ken S'), Text(13, 0, 'Niigata-ken '), Text(14, 0, 'Shizuoka-ken'), Text(15, 0, 'Tōkyō-to Ada'), Text(16, 0, 'Tōkyō-to Bun'), Text(17, 0, 'Tōkyō-to Chi'), Text(18, 0, 'Tōkyō-to Chū'), Text(19, 0, 'Tōkyō-to Edo'), Text(20, 0, 'Tōkyō-to Fuc'), Text(21, 0, 'Tōkyō-to Ita'), Text(22, 0, 'Tōkyō-to Kat'), Text(23, 0, 'Tōkyō-to Kit'), Text(24, 0, 'Tōkyō-to Kog'), Text(25, 0, 'Tōkyō-to Kōt'), Text(26, 0, 'Tōkyō-to Mac'), Text(27, 0, 'Tōkyō-to Meg'), Text(28, 0, 'Tōkyō-to Min'), Text(29, 0, 'Tōkyō-to Mus'), Text(30, 0, 'Tōkyō-to Nak'), Text(31, 0, 'Tōkyō-to Ner'), Text(32, 0, 'Tōkyō-to Set'), Text(33, 0, 'Tōkyō-to Shi'), Text(34, 0, 'Tōkyō-to Sug'), Text(35, 0, 'Tōkyō-to Tac'), Text(36, 0, 'Tōkyō-to Tai'), Text(37, 0, 'Tōkyō-to Tos'), Text(38, 0, 'Tōkyō-to Ōta'), Text(39, 0, 'Ōsaka-fu Hig'), Text(40, 0, 'Ōsaka-fu Ney'), Text(41, 0, 'Ōsaka-fu Sak'), Text(42, 0, 'Ōsaka-fu Sui'), Text(43, 0, 'Ōsaka-fu Ōsa')])
# Set labels and title
plt.xlabel('Area')
plt.ylabel('Air Genre Name')
plt.tick_params(axis='x', pad=0, labelsize=7)
# Adjust legend position
plt.legend(bbox_to_anchor = (1,1),loc='upper left')
# Show the plot
plt.show()# Create a new column 'area' by extracting the first 12 characters from 'air_area_name'
hpg_store['area'] = hpg_store['hpg_area_name'].str[:10]
#Count the frequency of each combination of 'area' and 'air_genre_name'
counts = hpg_store.groupby(['area', 'hpg_genre_name'], observed=True).size().reset_index(name='count')
# Plotting
plt.figure(figsize=(8, 4))
sns.scatterplot(data=counts, x='area', y='hpg_genre_name',
size='count', color='red', legend=True)
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32], [Text(0, 0, 'Fukuoka-ke'), Text(1, 0, 'Hiroshima-'), Text(2, 0, 'Hokkaidō A'), Text(3, 0, 'Hokkaidō H'), Text(4, 0, 'Hokkaidō S'), Text(5, 0, 'Hokkaidō T'), Text(6, 0, 'Hyōgo-ken '), Text(7, 0, 'Kanagawa-k'), Text(8, 0, 'Miyagi-ken'), Text(9, 0, 'Niigata-ke'), Text(10, 0, 'None None '), Text(11, 0, 'Osaka Pref'), Text(12, 0, 'Saitama-ke'), Text(13, 0, 'Shizuoka-k'), Text(14, 0, 'Tōkyō-to A'), Text(15, 0, 'Tōkyō-to B'), Text(16, 0, 'Tōkyō-to C'), Text(17, 0, 'Tōkyō-to F'), Text(18, 0, 'Tōkyō-to H'), Text(19, 0, 'Tōkyō-to I'), Text(20, 0, 'Tōkyō-to K'), Text(21, 0, 'Tōkyō-to M'), Text(22, 0, 'Tōkyō-to N'), Text(23, 0, 'Tōkyō-to S'), Text(24, 0, 'Tōkyō-to T'), Text(25, 0, 'Ōsaka-fu H'), Text(26, 0, 'Ōsaka-fu I'), Text(27, 0, 'Ōsaka-fu K'), Text(28, 0, 'Ōsaka-fu M'), Text(29, 0, 'Ōsaka-fu N'), Text(30, 0, 'Ōsaka-fu S'), Text(31, 0, 'Ōsaka-fu T'), Text(32, 0, 'Ōsaka-fu Ō')])
# Set labels and title
plt.xlabel('Area')
plt.ylabel('HPG Genre Name')
plt.tick_params(axis='x', pad=0, labelsize=7)
plt.tick_params(axis='y', pad=0, labelsize=7)
# Adjust legend position
plt.legend(bbox_to_anchor = (1,1),loc='upper left')
# Show the plot
plt.show()# Group by 'air_genre_name' and 'air_area_name', and count the frequency
counts = air_store.groupby(['air_genre_name', 'air_area_name'], observed=True).size().reset_index(name='count')
# Reorder 'air_genre_name' by count
order = counts.groupby('air_genre_name', observed=True)['count'].mean().sort_values().index
# Plotting
plt.figure(figsize=(8, 6))
sns.boxplot(data=counts, y='air_genre_name', x='count', order=order, hue='air_genre_name' ,palette='viridis')
sns.stripplot(data=counts, y='air_genre_name', x='count', order=order, color='blue', size=3)
# Set labels and title
plt.xlabel('Occurences per air area')
plt.ylabel('Air genre')
plt.xscale('log')
# Show the plot
plt.show()## unexpected indent (<string>, line 1)
selected_visits = air_visits[air_visits['air_store_id'].isin(["air_b5598d12d1b84890", "air_bbe1c1a47e09f161"])]
selected_visits.sort_values(by='visit_date').head(10)## air_store_id visit_date visitors
## 239978 air_bbe1c1a47e09f161 2016-07-01 7
## 250021 air_b5598d12d1b84890 2016-07-01 5
## 250022 air_b5598d12d1b84890 2016-07-02 14
## 250023 air_b5598d12d1b84890 2016-07-03 6
## 250024 air_b5598d12d1b84890 2016-07-04 6
## 250025 air_b5598d12d1b84890 2016-07-05 5
## 250026 air_b5598d12d1b84890 2016-07-06 5
## 239979 air_bbe1c1a47e09f161 2016-07-07 1
## 250027 air_b5598d12d1b84890 2016-07-08 10
## 239980 air_bbe1c1a47e09f161 2016-07-08 7
foobar = hpg_store.groupby(['hpg_genre_name', 'hpg_area_name'], observed=True).size().reset_index(name='n')
# Reorder 'hpg_genre_name' by count
#order = counts.groupby('hpg_genre_name', observed=True)['n'].mean().sort_values().index
order = foobar.groupby('hpg_genre_name', observed=True)['n'].mean().sort_values(ascending=False).index
plt.figure(figsize=(8, 6))
# Create the boxplot with jitter
sns.boxplot(data=foobar, y='hpg_genre_name', x='n', order=order, color='lightgrey')
sns.stripplot(data=foobar, y='hpg_genre_name', x='n', order=order, color='red', alpha=0.7)
# Set y-axis to log scale
#plt.yscale('log')
# Set labels and title
plt.xlabel('hpg genre')
plt.ylabel('Cases per hpg area')
plt.tick_params(axis='y', pad=0, labelsize=7)
# Rotate x-axis labels
#plt.xticks(rotation=45)
# Show the plot
plt.show()# Compute 'foo'
foo = air_visits.merge(air_store, on='air_store_id', how='left')
# Compute 'bar'
bar = air_store.groupby(['air_genre_name', 'air_area_name'], observed=True).size().reset_index(name='n')
# Compute 'foobar' (Assuming 'hpg_store' DataFrame is available)
foobar = hpg_store.groupby(['hpg_genre_name', 'hpg_area_name'], observed=True).size().reset_index(name='n')
p3_data = (
foo.groupby(['air_genre_name', 'air_area_name'], observed=True)
.agg(mean_log_visit=('visitors', lambda x: x.apply(lambda x: np.log1p(x)).mean()))
.reset_index()
.merge(bar, left_on=['air_genre_name', 'air_area_name'], right_on=['air_genre_name', 'air_area_name'], how='left')
.groupby('n', observed=True)
.agg(
mean_mlv=('mean_log_visit', 'mean'),
sd_mlv=('mean_log_visit', 'std')
)
.fillna(0)
.reset_index()
)
# Multiplot layout
fig = plt.figure(figsize=(10, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
# Assign each plot to a specific position
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[1, :])
# Plot p1 and p2 side by side
sns.histplot(data=bar, x='n', color='blue', binwidth=1, ax=ax1)
ax1.set(xlabel='Air Genres per area', ylabel='Count')
sns.histplot(data=foobar, x='n', color='red', binwidth=1, ax=ax2)
ax1.set(xlabel='HPG Genres per area', ylabel='Count')
# Plot p3
p3 = sns.scatterplot(data=p3_data, x='n', y='mean_mlv', color='blue', s=40, ax=ax3)
p3.errorbar(p3_data['n'], p3_data['mean_mlv'], yerr=p3_data['sd_mlv'], fmt='o', color='blue')
p3.set(xlabel='Cases of identical Air genres per area', ylabel='Mean +/- SD of\n mean log1p visitors')
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()We will try another method of quantifying the
distinctiveness of a restaurant by the distance to its
neighbouring restaurants in the feature engineering section.
Next we will turn our attention to the reservation numbers in the air_reserve and hpg_reserve data sets. We have seen their time series and distributions back in Sections 4.2 and 4.3; now we will compare the reservation numbers to the actual visitor numbers.
For this, we compute the sum of
reserve_visitors per day (i.e. the number of people
reservations were made for) for each restaurant id and then join these
summaries to the air_visitors file. In order to include the
hpg reservations we need to use the store_ids data to join the
hpg_store_ids from the hpg_reserve file to the
corresponding air_store_ids:
foo <- air_reserve %>%
mutate(visit_date = date(visit_datetime)) %>%
group_by(air_store_id,visit_date) %>%
summarise(reserve_visitors_air = sum(reserve_visitors))## `summarise()` has grouped output by 'air_store_id'. You can override using the
## `.groups` argument.
bar <- hpg_reserve %>%
mutate(visit_date = date(visit_datetime)) %>%
group_by(hpg_store_id,visit_date) %>%
summarise(reserve_visitors_hpg = sum(reserve_visitors)) %>%
inner_join(store_ids, by = "hpg_store_id")## `summarise()` has grouped output by 'hpg_store_id'. You can override using the
## `.groups` argument.
all_reserve <- air_visits %>%
inner_join(foo, by = c("air_store_id", "visit_date")) %>%
inner_join(bar, by = c("air_store_id", "visit_date")) %>%
mutate(reserve_visitors = reserve_visitors_air + reserve_visitors_hpg)Now we will plot the total reserve_visitor numbers
against the actual visitor numbers for the air restaurants. We use a
scatter plot to which we are adding marginal histograms via the
ggMarginal function of the ggExtra package. The
grey line shows \(reserve_visitors ==
visitors\) and the blue line is a linear fit:
p <- all_reserve %>%
filter(reserve_visitors < 120) %>%
ggplot(aes(reserve_visitors, visitors)) +
geom_point(color = "black", alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, color = "grey60") +
geom_smooth(method = "lm", color = "blue")
ggMarginal(p, type="histogram", fill = "blue", bins=50)## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
We find:
The histograms show that the reserve_visitors and visitors numbers peak below ~20 and are largely confined to the range below 100.
The scatter points fall largely above the line of identity, indicating that there were more visitors that day than had reserved a table. This is not surprising, since a certain number of people will always be walk-in customers.
A notable fraction of the points is below the line, which probably indicates that some people made a reservation but changed their mind and didn’t go. That kind of effect is probably to be expected and taking it into account will be one of the challenges in this competition.
The linear fit suggests a trend in which larger numbers of reserve_visitors are more likely to underestimate the eventual visitor numbers. This is not surprising either, since I can imagine that it is more likely that (a) a large reservation is cancelled than (b) a large group of people walk in a restaurant without reservation.
Now we will break down the discrepancy between visitors - reserve_visitors over time, look at the overall histograms, and visualise the air_reserve vs hpg_reserve numbers separately. Here, the time series for air (blue) and hpg (red) are offset vertically by 150 and -250 (see the solid black baselines):
p1 <- all_reserve %>%
ggplot(aes(visitors - reserve_visitors)) +
geom_histogram(binwidth = 5, fill = "black") +
coord_flip() +
labs(x = "")
p2 <- all_reserve %>%
ggplot(aes(visitors - reserve_visitors_air)) +
geom_histogram(binwidth = 5, fill = "blue") +
coord_flip() +
labs(x = "")
p3 <- all_reserve %>%
ggplot(aes(visitors - reserve_visitors_hpg)) +
geom_histogram(binwidth = 5, fill = "red") +
coord_flip() +
labs(x = "")
p4 <- all_reserve %>%
ggplot(aes(visit_date, visitors - reserve_visitors)) +
geom_hline(yintercept = c(150, 0, -250)) +
geom_line() +
geom_line(aes(visit_date, visitors - reserve_visitors_air + 150), color = "blue") +
geom_line(aes(visit_date, visitors - reserve_visitors_hpg - 250), color = "red") +
ggtitle("Visitors - Reserved: all (black), air (blue), hpg (red)")
layout <- matrix(c(4,4,2,4,4,1,4,4,3),3,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)We find:
The time series show significant scatter throughout the training time range. While the air (blue) and hpg (red) curves are predominantly above the baseline (i.e. more visitors than reservations), combining the two data sets brings the mean of the distribution closer to the zero line. This can also be seen in the corresponding histograms on the right side.
We see the gap where there are no air reservations (compare Sect. 4.2). We could only look at the hpg reservations here (for which this gap does not exist, Sect. 4.3) but it appears safe to assume that they would follow the same trend and can be used as a proxy for the air reservations. Feel free to check this assumption for the gap.
The (flipped) histograms in the 3 right panels are roughly aligned with the time series in the left panel for convenience of interpretation. They demonstrate how much the distributions are skewed towards larger visitor numbers than reserve_visitor numbers. We might see a mix here between two distributions: a (probably normal) spread due to cancellations plus a tail from walk-in visitors, which should follow a Poisson distribution.
Finally, let’s have a quick look at the impact of holidays on the discrepancy between reservations and visitors. We’ll be using overlapping density plots:
all_reserve %>%
mutate(date = visit_date) %>%
left_join(holidays, by = "date") %>%
ggplot(aes(visitors - reserve_visitors, fill = holiday_flg)) +
geom_density(alpha = 0.5)We find:
# Process 'air_reserve' data
foo = (
air_reserve.assign(visit_date=air_reserve['visit_datetime'].dt.date)
.groupby(['air_store_id', 'visit_date'], observed=True)
.agg(reserve_visitors_air=('reserve_visitors', 'sum'))
.reset_index()
)
# Process 'hpg_reserve' data
bar = (
hpg_reserve.assign(visit_date=hpg_reserve['visit_datetime'].dt.date)
.groupby(['hpg_store_id', 'visit_date'], observed=True)
.agg(reserve_visitors_hpg=('reserve_visitors', 'sum'))
.reset_index()
.merge(store_ids, left_on='hpg_store_id', right_on='hpg_store_id', how='inner')
)
air_visits['visit_date'] = air_visits['visit_date'].dt.date
# Combine reservation data with 'air_visits'
all_reserve = (
air_visits.merge(foo, left_on=['air_store_id', 'visit_date'], right_on=['air_store_id', 'visit_date'], how='inner')
.merge(bar, left_on=['air_store_id', 'visit_date'], right_on=['air_store_id', 'visit_date'], how='inner')
.assign(reserve_visitors=lambda x: x['reserve_visitors_air'] + x['reserve_visitors_hpg'])
)# Filter the data
filtered_data = all_reserve[all_reserve['reserve_visitors'] < 120]
fig = plt.figure(figsize=(10, 6))
plt.style.use('ggplot')
# Create the scatterplot with marginal histograms
g = sns.jointplot(x='reserve_visitors', y='visitors', data=filtered_data, color='blue', alpha=0.2, kind='scatter', palette='black')
g.plot_joint(sns.lineplot, color='grey') # Add the diagonal line## <seaborn.axisgrid.JointGrid object at 0x7f9505fffb80>
# Add a linear regression line
sns.regplot(x='reserve_visitors', y='visitors', data=filtered_data, ax=g.ax_joint, line_kws={'color': 'blue'})
# Add marginal histograms
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Scatterplot with Marginal Histograms')
plt.show()# Sort DataFrame by visit_date
all_reserve = all_reserve.sort_values(by='visit_date')
# Plot 1
plt.figure(figsize=(9, 6))
gs = GridSpec(3, 3)
plt.style.use('ggplot')
ax1 = plt.subplot(gs[0, 2])
ax1.hist(all_reserve['visitors'] - all_reserve['reserve_visitors'],
bins=range(-50, 60, 5), color='black', orientation='horizontal')
ax1.set_xlabel("")
ax1.invert_yaxis()
# Set size and rotation for x-axis tick labels
ax1.tick_params(axis='x', labelsize=8, rotation=0)
ax1.tick_params(axis='y', labelsize=8, rotation=0)
# Plot 2
ax2 = plt.subplot(gs[1, 2])
ax2.hist(all_reserve['visitors'] - all_reserve['reserve_visitors_air'],
bins=range(-50, 60, 5), color='blue', orientation='horizontal')
ax2.set_xlabel("")
ax2.invert_yaxis()
# Set size and rotation for x-axis tick labels
ax2.tick_params(axis='x', labelsize=8, rotation=0)
ax2.tick_params(axis='y', labelsize=8, rotation=0)
# Plot 3
ax3 = plt.subplot(gs[2, 2])
ax3.hist(all_reserve['visitors'] - all_reserve['reserve_visitors_hpg'],
bins=range(-50, 60, 5), color='red', orientation='horizontal')
ax3.set_xlabel("")
ax3.invert_yaxis()
# Set size and rotation for x-axis tick labels
ax3.tick_params(axis='x', labelsize=8, rotation=0)
ax3.tick_params(axis='y', labelsize=8, rotation=0)
ax3.set(xlabel='count') # ylabel='Count'
# Plot 4
ax4 = plt.subplot(gs[0:3, 0:2])
#ax4.axhline(y=[-300, 0, 300], color=['red', 'black', 'blue'], linestyle='--')
ax4.plot(all_reserve['visit_date'],
all_reserve['visitors'] - all_reserve['reserve_visitors'], color='black')
ax4.plot(all_reserve['visit_date'],
all_reserve['visitors'] - all_reserve['reserve_visitors_air'] + 150, color='blue')
ax4.plot(all_reserve['visit_date'],
all_reserve['visitors'] - all_reserve['reserve_visitors_hpg'] - 250, color='red')
ax4.set_title("Visitors - Reserved: all (black), air (blue), hpg (red)", fontsize=10)
# Format the x-axis date labels
ax4.xaxis.set_major_locator(mdates.MonthLocator(interval=3))
ax4.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
# Set size and rotation for x-axis tick labels
ax4.tick_params(axis='x', labelsize=8, rotation=0)
ax4.set(xlabel='visit_date', ylabel= "visitors - reserve_visitors")
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()# Convert visit_date to datetime if needed
all_reserve['visit_date'] = pd.to_datetime(all_reserve['visit_date'])
# Rename 'visit_date' column before merging
all_reserve['date'] = all_reserve['visit_date']
# Left join with holidays
reserve_holy = all_reserve.merge(holidays, on='date', how='left')
# Create the density plot
#sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
plt.style.use('ggplot')
sns.histplot(data=reserve_holy, x=(reserve_holy['visitors'] - reserve_holy['reserve_visitors']),
kde=True, #kde_kws={'bw_method': 1},
hue='holiday_flg', fill=True, alpha=0.5, palette='Set1')
plt.title("Density Plot of Visitors - Reserve Visitors")
plt.xlabel("Visitors - Reserve Visitors")
plt.ylabel("Density")
#plt.legend(loc='lower left')
plt.show()## unexpected indent (<string>, line 2)
The next step is to derive new features from the existing ones and the purpose of these new features is to provide additional predictive power for our goal to forecast visitor numbers. Those new features can be as simple as deriving the day of the week or the month from a date column; something that we have already used in Fig. 1 of our visual exploration. Or they can take a more complex shape from the interplay of several related variables. This section collects and studies these new features.
My personal preference is to collect all engineered features into a single code block, so that we don’t have to search for them in different places of the kernel. Often, this also makes the computation of these features easier as we can re-use certain intermediate transformations. As we expand our analysis, we will come back to this code block and it will grow as our feature space grows:
air_visits <- air_visits %>%
mutate(wday = wday(visit_date, label=TRUE, week_start = 1),
wday = fct_relevel(wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = month(visit_date, label=TRUE))
air_reserve <- air_reserve %>%
mutate(reserve_date = date(reserve_datetime),
reserve_hour = hour(reserve_datetime),
reserve_wday = wday(reserve_datetime, label = TRUE, week_start = 1),
reserve_wday = fct_relevel(reserve_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
visit_date = date(visit_datetime),
visit_hour = hour(visit_datetime),
visit_wday = wday(visit_datetime, label = TRUE, week_start = 1),
visit_wday = fct_relevel(visit_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
diff_day = time_length(visit_datetime - reserve_datetime, unit = "day"))
hpg_reserve <- hpg_reserve %>%
mutate(reserve_date = date(reserve_datetime),
reserve_hour = hour(reserve_datetime),
reserve_wday = wday(reserve_datetime, label = TRUE, week_start = 1),
reserve_wday = fct_relevel(reserve_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
visit_date = date(visit_datetime),
visit_hour = hour(visit_datetime),
visit_wday = wday(visit_datetime, label = TRUE, week_start = 1),
visit_wday = fct_relevel(visit_wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
diff_hour = time_length(visit_datetime - reserve_datetime, unit = "hour"),
diff_day = time_length(visit_datetime - reserve_datetime, unit = "day"))
# count stores in area
air_count <- air_store %>%
group_by(air_area_name) %>%
summarise(air_count = n())
hpg_count <- hpg_store %>%
group_by(hpg_area_name) %>%
summarise(hpg_count = n())
# distances
med_coord_air <- air_store %>%
summarise_at(vars(longitude:latitude), median)
med_coord_hpg <- hpg_store %>%
summarise_at(vars(longitude:latitude), median)
air_coords <- air_store %>%
select(longitude, latitude)
hpg_coords <- hpg_store %>%
select(longitude, latitude)
air_store$dist <- distCosine(air_coords, med_coord_air)/1e3
hpg_store$dist <- distCosine(hpg_coords, med_coord_hpg)/1e3
# apply counts, dist; add prefecture
air_store <- air_store %>%
mutate(dist_group = as.integer(case_when(
dist < 80 ~ 1,
dist < 300 ~ 2,
dist < 500 ~ 3,
dist < 750 ~ 4,
TRUE ~ 5))) %>%
left_join(air_count, by = "air_area_name") %>%
separate(air_area_name, c("prefecture"), sep = " ", remove = FALSE)## Warning: Expected 1 pieces. Additional pieces discarded in 829 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
hpg_store <- hpg_store %>%
mutate(dist_group = as.integer(case_when(
dist < 80 ~ 1,
dist < 300 ~ 2,
dist < 500 ~ 3,
dist < 750 ~ 4,
TRUE ~ 5))) %>%
left_join(hpg_count, by = "hpg_area_name") %>%
separate(hpg_area_name, c("prefecture"), sep = " ", remove = FALSE)## Warning: Expected 1 pieces. Additional pieces discarded in 4690 rows [1, 2, 3, 4, 5, 6,
## 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
from pandas.api.types import CategoricalDtype
# Convert visit_date to datetime
air_visits['visit_date'] = pd.to_datetime(air_visits['visit_date'])
# Extract weekday
air_visits['wday'] = air_visits['visit_date'].dt.day_name()
# Reorder weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
air_visits['wday'] = air_visits['wday'].astype(CategoricalDtype(categories=weekday_order, ordered=True))
# Extract month
air_visits['month'] = air_visits['visit_date'].dt.month_name()
# Convert visit_datetime to datetime
air_reserve['reserve_datetime'] = pd.to_datetime(air_reserve['reserve_datetime'])
air_reserve['visit_datetime'] = pd.to_datetime(air_reserve['visit_datetime'])
# Extract various date and time components
air_reserve['reserve_date'] = air_reserve['reserve_datetime'].dt.date
air_reserve['reserve_hour'] = air_reserve['reserve_datetime'].dt.hour
air_reserve['reserve_wday'] = air_reserve['reserve_datetime'].dt.day_name()
air_reserve['reserve_wday'] = air_reserve['reserve_wday'].astype(CategoricalDtype(categories=weekday_order, ordered=True))
air_reserve['visit_date'] = air_reserve['visit_datetime'].dt.date
air_reserve['visit_hour'] = air_reserve['visit_datetime'].dt.hour
air_reserve['visit_wday'] = air_reserve['visit_datetime'].dt.day_name()
air_reserve['visit_wday'] = air_reserve['visit_wday'].astype(CategoricalDtype(categories=weekday_order, ordered=True))
# Calculate time differences
air_reserve['diff_hour'] = (air_reserve['visit_datetime'] - air_reserve['reserve_datetime']).dt.total_seconds() / 3600
#air_reserve['diff_day'] = (air_reserve['visit_datetime'].dt.date - air_reserve['reserve_datetime'].dt.date)#.dt.days
air_reserve['diff_day'] = (air_reserve['visit_datetime'] - air_reserve['reserve_datetime']).apply(lambda x: x.total_seconds() / 86400)
# Count stores in area
air_count = air_store.groupby('air_area_name', observed=True).size().reset_index(name='air_count')
hpg_count = hpg_store.groupby('hpg_area_name', observed=True).size().reset_index(name='hpg_count')
# Calculate distances
med_coord_air = air_store[['longitude', 'latitude']].median()
med_coord_hpg = hpg_store[['longitude', 'latitude']].median()
air_coords = air_store[['longitude', 'latitude']]
hpg_coords = hpg_store[['longitude', 'latitude']]
air_store['dist'] = ((air_coords - med_coord_air)**2).sum(axis=1).pow(0.5) / 1000
hpg_store['dist'] = ((hpg_coords - med_coord_hpg)**2).sum(axis=1).pow(0.5) / 1000
# Apply counts, dist; add prefecture
def dist_group(x):
if x < 80:
return 1
elif x < 300:
return 2
elif x < 500:
return 3
elif x < 750:
return 4
else:
return 5
air_store['dist_group'] = air_store['dist'].apply(dist_group)
air_store = pd.merge(air_store, air_count, left_on='air_area_name', right_on='air_area_name', how='left')
# Split 'air_area_name' into 'prefecture' and 'area'
split_data = air_store['air_area_name'].str.split(' ', n=1, expand=True)
# Assign the first part to 'prefecture' and the rest to 'area'
air_store['prefecture'] = split_data[0]
air_store['area'] = split_data[1]
# If there was only one part (no space), assign an empty string to 'area'
air_store['area'].fillna("", inplace=True)
hpg_store['dist_group'] = hpg_store['dist'].apply(dist_group)
hpg_store = pd.merge(hpg_store, hpg_count, left_on='hpg_area_name', right_on='hpg_area_name', how='left')
# Split 'air_area_name' into 'prefecture' and 'area'
split_data = hpg_store['hpg_area_name'].str.split(' ', n=1, expand=True)
# Assign the first part to 'prefecture' and the rest to 'area'
hpg_store['prefecture'] = split_data[0]
hpg_store['area'] = split_data[1]
# If there was only one part (no space), assign an empty string to 'area'
hpg_store['area'].fillna("", inplace=True)We start simple, with the week day (wday) and month features derived from the visit_date. We already looked at the total visitors per week day and month in Fig. 1. Now we will study the log1p transformed visitor numbers. We compute their mean and standard deviation and plot them alongside the (ridgeline) density distributions:
p1 <- air_visits %>%
group_by(wday) %>%
summarise(mean_log_visitors = mean(log1p(visitors)),
sd_log_visitors = sd(log1p(visitors))) %>%
ggplot(aes(wday, mean_log_visitors, color = wday)) +
geom_point(size = 4) +
geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
ymax = mean_log_visitors + sd_log_visitors,
color = wday), width = 0.5, size = 0.7) +
theme(legend.position = "none")
p2 <- air_visits %>%
mutate(visitors = log1p(visitors)) %>%
ggplot(aes(visitors, wday, fill = wday)) +
geom_density_ridges(bandwidth = 0.1) +
scale_x_log10() +
theme(legend.position = "none") +
labs(x = "log1p(visitors)", y = "")
p3 <- air_visits %>%
group_by(month) %>%
summarise(mean_log_visitors = mean(log1p(visitors)),
sd_log_visitors = sd(log1p(visitors))) %>%
ggplot(aes(month, mean_log_visitors, color = month)) +
geom_point(size = 4) +
geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
ymax = mean_log_visitors + sd_log_visitors,
color = month), width = 0.5, size = 0.7) +
theme(legend.position = "none")
p4 <- air_visits %>%
mutate(visitors = log1p(visitors)) %>%
ggplot(aes(visitors, month, fill = month)) +
geom_density_ridges(bandwidth = 0.1) +
scale_x_log10() +
theme(legend.position = "none") +
labs(x = "log1p(visitors)", y = "")
layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)We find:
The differences in mean log(visitor) numbers between the days of the week are much smaller than the corresponding uncertainties. This is confirmed by the density plots that show a strong overlap. Still, there is a trend toward higher visitor numbers on the weekend that might be useful.
The months are even more similar, including the sligthly increased numbers in December. Overall, there is not much difference.
# Compute log1p of visitors
air_visits['log1p_visitors'] = air_visits['visitors'].apply(lambda x: np.log1p(x))
# Plot 1 data
p1_data = air_visits.groupby('wday', observed=True).agg(mean_log_visitors=('log1p_visitors', 'mean'),
sd_log_visitors=('log1p_visitors', 'std')).reset_index()
p3_data = air_visits.groupby('month', observed=True).agg(mean_log_visitors=('log1p_visitors', 'mean'),
sd_log_visitors=('log1p_visitors', 'std')).reset_index()
# Create a 2x2 grid
fig, axes = plt.subplots(1, 2, figsize=(10, 6))
plt.style.use('ggplot')
# Define custom order for weekdays and months
custom_wday_order = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
custom_month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
# Plot 1
p1 = sns.scatterplot(data=p1_data, x='wday', y='mean_log_visitors',
hue='wday', #palette='viridis',
ax=axes[0])
p1.errorbar(p1_data['wday'], p1_data['mean_log_visitors'],
yerr=p1_data['sd_log_visitors'], fmt='o',
color='black',
markersize=5,capsize=8)
# Set custom order for x-axis labels
#p1.xaxis.set_ticks(custom_wday_order)
p1.set_xticklabels(custom_wday_order)
p1.legend('')
# Plot 3
p3 = sns.scatterplot(data=p3_data, x='month', y='mean_log_visitors',
hue='month', #palette='viridis',
ax=axes[1])
p3.errorbar(p3_data['month'], p3_data['mean_log_visitors'],
yerr=p3_data['sd_log_visitors'],
fmt='o',
color='black',
#palette='viridis',
markersize=5,capsize=8)
# Set custom order for x-axis labels
#p3.xaxis.set_ticks(custom_month_order)
p3.set_xticklabels(custom_month_order, rotation=45)
p3.legend('')
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()# Plot 2
p2_data = air_visits.copy()
p2_data['log1p_visitors'] = p2_data['visitors'].apply(lambda x: np.log1p(x))
fig = plt.figure(figsize=(12, 12))
g = sns.FacetGrid(p2_data,
row='wday',
hue="wday", palette='viridis',
aspect=15, height=.5
)
p2= g.map(sns.kdeplot, 'log1p_visitors',
bw_adjust=.5, clip_on=False,
fill=True, alpha=1, linewidth=.5)
p2= g.map(sns.kdeplot, 'log1p_visitors', clip_on=False, color="black", lw=2, bw_adjust=.5)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, color=color, #fontweight="bold",
ha="right", va="center", transform=ax.transAxes)
p2= g.map(label, 'log1p_visitors')
p2= g.set(xlim=(0, 10))
p2= g.set_titles("")
p2= g.set(yticks=[], ylabel="")
p2= g.despine(bottom=True, left=True)
p2= g.fig.subplots_adjust(hspace=-.25)
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show(p2)# Plot 4
p4_data = air_visits.copy()
p4_data['log1p_visitors'] = p4_data['visitors'].apply(lambda x: np.log1p(x))
fig = plt.figure(figsize=(20, 20))
g = sns.FacetGrid(p2_data,
row='month',
hue="month", palette='viridis',
aspect=15, height=.5
)
p4= g.map(sns.kdeplot, 'log1p_visitors',
bw_adjust=.5, clip_on=False,
fill=True, alpha=1, linewidth=.5)
p4= g.map(sns.kdeplot, 'log1p_visitors', clip_on=False, color="black", lw=2, bw_adjust=.5)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, color=color, #fontweight="bold",
ha="right", va="center", transform=ax.transAxes)
p4= g.map(label, 'log1p_visitors')
p4= g.set(xlim=(0, 10))
p4= g.set_titles("")
p4= g.set(yticks=[], ylabel="")
p4= g.despine(bottom=True, left=True)
p4= g.fig.subplots_adjust(hspace=-.25)
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show(p4)import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Define a function to compute log1p of visitors
def compute_log1p_visitors(visitors):
return np.log1p(visitors)
# Create subplots for p2 and p4
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 12))
# Plot 2
p2_data = air_visits.copy()
p2_data['log1p_visitors'] = compute_log1p_visitors(p2_data['visitors'])
p2 = sns.FacetGrid(p2_data, row='wday', hue="wday", palette='viridis',
aspect=15, height=.5)
p2.map(sns.kdeplot, 'log1p_visitors', bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=.5, ax=ax1)# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, color=color, ha="right", va="center", transform=ax.transAxes)
p2.map(label, 'log1p_visitors')p2.fig.subplots_adjust(hspace=-.25)
# Plot 4
p4_data = air_visits.copy()
p4_data['log1p_visitors'] = compute_log1p_visitors(p4_data['visitors'])
p4 = sns.FacetGrid(p4_data, row='month', hue="month", palette='viridis',
aspect=15, height=.5)
p4.map(sns.kdeplot, 'log1p_visitors', bw_adjust=.5, clip_on=False, fill=True, alpha=1, linewidth=.5, ax=ax2)Here is a breakdown by air_genre_name:
air_visits %>%
left_join(air_store, by = "air_store_id") %>%
group_by(wday, air_genre_name) %>%
summarise(mean_log_visitors = mean(log1p(visitors)),
sd_log_visitors = sd(log1p(visitors))) %>%
ggplot(aes(wday, mean_log_visitors, color = wday)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
ymax = mean_log_visitors + sd_log_visitors,
color = wday), width = 0.5, size = 0.7) +
theme(legend.position = "none", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
facet_wrap(~ air_genre_name)## `summarise()` has grouped output by 'wday'. You can override using the
## `.groups` argument.
We find:
There is a stronger impact of the weekend vs weekdays for genres such as “Karaoke” or “International Cuisine”, while others show very little change over the days of the week (e.g. “Japanese food”).
Overall, there is no significant effect for any of the genres on any day of the week.
What about the statistics of visitors vs reservations? Does it depend on the day of the week? We use boxplots to find out:
all_reserve %>%
mutate(wday = wday(visit_date, label=TRUE, week_start = 1),
wday = fct_relevel(wday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
ggplot(aes(wday, visitors - reserve_visitors, fill = wday)) +
geom_boxplot() +
theme(legend.position = "none")We find that there is no significant difference, although we see a slight trend for more visitors compared to reservations on the weekend, especially Sunday.
## 'en_US.utf8'
# Compute log1p of visitors
air_visits['log1p_visitors'] = np.log1p(air_visits['visitors'])
# Merge air_visits and air_store
merged_data = pd.merge(air_visits, air_store, on='air_store_id', how='left')
# Group by wday and air_genre_name, compute mean and standard deviation
summary_data = merged_data.groupby(['wday', 'air_genre_name'], observed=True).agg(
mean_log_visitors=('log1p_visitors', 'mean'),
sd_log_visitors=('log1p_visitors', 'std')
).reset_index()
fig = plt.figure(figsize = (12,8))
plt.style.use('ggplot')
g = sns.FacetGrid(summary_data, col="air_genre_name", col_wrap=4,
sharex=True, sharey=True, height=5)
g.map_dataframe(sns.scatterplot, x='wday', y='mean_log_visitors',
hue='wday', palette='viridis', s=50, alpha=0.7)# Add error bars
for ax, (_, data) in zip(g.axes.flat, summary_data.groupby('air_genre_name')):
#sns.lineplot(data=data, x='wday', y='mean_log_visitors', ax=ax, color='black')
ax.errorbar(data['wday'], data['mean_log_visitors'], yerr=data['sd_log_visitors'], fmt='none', color='black', capsize=5)
ax.legend('')## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f95069e2ef0>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f9506314ac0>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f95063176a0>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f95063796c0>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f9506379390>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f950637b370>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f950637a080>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f95063250f0>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f950637add0>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f9506326e30>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f9506326b00>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f95045e4c40>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f950637ab90>
## <ErrorbarContainer object of 3 artists>
## <matplotlib.legend.Legend object at 0x7f95045e6980>
# Resize facet titles
p= g.set_titles(col_template="{col_name}",
size = 8, backgroundcolor='#DDDDDD', pad=2
# bbox={'facecolor': '#DDDDDD', 'alpha': 0, 'pad': 0}
)
p= g.tick_params(axis='x', pad=0, labelsize= 6)
p= g.set_xlabels(fontsize=6, label="Date")
p= g.set_ylabels(fontsize=8 )#,label="Average number of visitors")
for ax in g.axes.flat:
for label in ax.get_xticklabels():
label.set_rotation(45)
#ax.xaxis.set_major_formatter(dates.DateFormatter('%a'))
# Apply logarithmic scale to y-axis
#p= g.set(yscale="log")
# Set x-axis labels
p= g.set_axis_labels("Weekday", "Mean Log Visitors")
#p= g.add_legend(title="weekday")
# Adjust layout
p= plt.tight_layout()
# Show the plot
plt.show(p)# Convert visit_date to datetime and extract the weekday
all_reserve['visit_date'] = pd.to_datetime(all_reserve['visit_date'])
all_reserve['wday'] = all_reserve['visit_date'].dt.day_name()
# Reorder the weekdays
weekday_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
all_reserve['wday'] = pd.Categorical(all_reserve['wday'], categories=weekday_order, ordered=True)
#Create a new column for the difference between visitors and reserve_visitors
all_reserve['visitors_minus_reserve'] = all_reserve['visitors'] - all_reserve['reserve_visitors']
# Create the boxplot
plt.figure(figsize=(10, 6))## <Figure size 1000x600 with 0 Axes>
plt.style.use('ggplot')
sns.boxplot(data=all_reserve, x='wday', y='visitors_minus_reserve',
hue='wday', palette='viridis')## <Axes: xlabel='wday', ylabel='visitors_minus_reserve'>
## Text(0.5, 0, 'Weekday')
## Text(0, 0.5, 'Visitors - Reserve Visitors')
plt.tick_params(axis='x', pad=0, labelsize= 8)
#plt.set_major_formatter(dates.DateFormatter('%a'))
#plt.title('Difference between Visitors and Reserve Visitors by Weekday')
# Remove legend
#plt.legend().remove()
# Show the plot
plt.show()This feature will address the question of how many restaurant can be found in a certain area and whether larger numbers of restaurants per area affect the average visitor numbers.
In order to compute it, we simply count the number of air/hpg restaurants for the corresponding areas. We show the resulting distributions in the next plot at the top. The we estimate the mean log1p-transformed visitor count per restaurant, and then compute the mean and standard deviation of this number for each area count group:
p1 <- air_count %>%
ggplot(aes(air_count)) +
geom_histogram(binwidth = 2, fill = "blue")
p2 <- hpg_count %>%
ggplot(aes(hpg_count)) +
geom_histogram(binwidth = 5, fill = "red")
p3 <- air_visits %>%
left_join(air_store, by = "air_store_id") %>%
group_by(air_store_id, air_count) %>%
summarise(mean_store_visit = mean(log1p(visitors))) %>%
group_by(air_count) %>%
summarise(mean_log_visitors = mean(mean_store_visit),
sd_log_visitors = sd(mean_store_visit)) %>%
ggplot(aes(air_count, mean_log_visitors)) +
geom_point(size = 4, color = "blue") +
geom_errorbar(aes(ymin = mean_log_visitors - sd_log_visitors,
ymax = mean_log_visitors + sd_log_visitors),
color = "blue", width = 0.5, size = 0.7) +
geom_smooth(method = "lm", color = "black") +
labs(x = "Air restaurants per area")## `summarise()` has grouped output by 'air_store_id'. You can override using the
## `.groups` argument.
## `geom_smooth()` using formula = 'y ~ x'
We find:
Most areas in the air data set have only a few restaurants, with the distribution peaking at 2. This goes back to the earlier observation that no single air_genre can ever be found in an air_area. There are always at least 2 of the same type. Still odd. The air data also has a tentative 2nd peak around 16-20. The hpg data, with larger overall numbers, also peaks at low counts and has few other, smaller peaks through its decline.
The mean log1p visitor numbers per area have large uncertainties and are all consistent with each other. There is a slight trend, shown by the black linear regression, toward lower average numbers for larger areas but it is quite weak.
# Multiplot layout
fig = plt.figure(figsize=(9, 5))
gs = GridSpec(2, 2, figure=fig)
plt.style.use('ggplot')
ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[0, 1])
ax3 = fig.add_subplot(gs[1, :])
# Plot 1
p1 = sns.histplot(data=air_count, x='air_count',
bins=range(0, max(air_count['air_count'])+2, 2),
color='blue',
ax=ax1
)
p1.set_xlabel('Air Count')## Text(0.5, 0, 'Air Count')
## Text(0.5, 1.0, 'Air Count Histogram')
# Plot 2
p2 = sns.histplot(data=hpg_count, x='hpg_count',
bins=range(0, max(hpg_count['hpg_count'])+6, 5),
color='red',
ax=ax2)
p2.set_xlabel('HPG Count')## Text(0.5, 0, 'HPG Count')
## Text(0.5, 1.0, 'HPG Count Histogram')
# Preprocessing for Plot 3
air_visits_processed = air_visits.merge(air_store, on='air_store_id', how='left')
store_visits = air_visits_processed.groupby(['air_store_id', 'air_count'], observed=True)['visitors'].apply(lambda x: x.mean()).reset_index()
area_visits = store_visits.groupby('air_count', observed=True)['visitors'].agg(['mean', 'std']).reset_index()
# Plot 3
p3 = sns.scatterplot(data=area_visits, x='air_count', y='mean',
color='blue',
ax=ax3)
p3.errorbar(x=area_visits['air_count'], y=area_visits['mean'],
yerr=area_visits['std'], fmt='o',
color='blue', capsize=5)## <ErrorbarContainer object of 3 artists>
## <Axes: xlabel='air_count', ylabel='mean'>
## Text(0.5, 0, 'Air Restaurants per Area')
## Text(0, 0.5, 'Mean Log Visitors')
## Text(0.5, 1.0, 'Mean Log Visitors vs Air Restaurants per Area')
Whenever we have spatial coordinates in our data we automatically also have the distances between these coordinates. As a first approximation we use the linear distance between two points (i.e. as the crow flies).
To compute these distances we are using the distCosine function of the geosphere package for spherical trigonometry. This method gives us the shortest distance between two points on a spherical earth. For the purpose of this localised analysis we choose to ignore ellipsoidal distortion of the earth’s shape.
As the single reference point for all distances we choose the median latitude and median longitude. Our analysis can be extended to investigate all the pairwise distances between two restaurants with the same methodology. We won’t perform this study in our kernel, but I encourage anyone who wants to try it to see whether additional insight can be gained this way.
Here, we plot the histograms of the distances from the medians for all the restaurants in the air and hpg data. These distributions will suggest a grouping into 5 different bins with ranges of increasing distance. We then compute the mean log1p visitor numbers for the 5 groups and compare them:
foo <- air_store %>% select(latitude, longitude, dist_group) %>% mutate(dset = "air")
bar <- hpg_store %>% select(latitude, longitude, dist_group) %>% mutate(dset = "hpg")
leaflet(foo) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addScaleBar() %>%
addCircleMarkers(~longitude, ~latitude, group = "AIR",
color = "blue", fillOpacity = 0.5, radius = 3*foo$dist_group) %>%
addCircleMarkers(lng = bar$longitude, lat = bar$latitude, group = "HPG",
color = "red", fillOpacity = 0.5, radius = 3*bar$dist_group) %>%
addCircleMarkers(lng = med_coord_air$longitude, lat = med_coord_air$latitude, group = "Centre",
color = "darkgreen", fillOpacity = 1) %>%
addLayersControl(
overlayGroups = c("AIR", "HPG", "Centre"),
options = layersControlOptions(collapsed = FALSE)
)foo = air_store[['latitude', 'longitude', 'dist_group']].assign(dset='air')
bar = hpg_store[['latitude', 'longitude', 'dist_group']].assign(dset='hpg')
# Create a base map
m = folium.Map(location=[foo['latitude'].mean(),
foo['longitude'].mean()], zoom_start=5,
tiles="CartoDB Positron",
attr="<a href=https://endless-sky.github.io/>Endless Sky</a>")
# m = folium.Map(location=[35.647042153196615, 137.41531074957783], zoom_start=10,
# tiles='https://{s}.basemaps.cartocdn.com/rastertiles/voyager_nolabels/{z}/{x}/{y}{r}.png',
# attr='CartoDB.Voyager', max_zoom=20)
# Add tiles
folium.TileLayer('CartoDB positron').add_to(m)## <folium.raster_layers.TileLayer object at 0x7f9504a767a0>
## <folium.map.LayerControl object at 0x7f9504607250>
# Define a function to add markers for hpg_store
# def add_marker(row):
# folium.Marker(location=[row['latitude'], row['longitude']],
# popup=row['hpg_store_id'],
# icon=None,
# tooltip=row['hpg_genre_name']).add_to(m)
# Add markers for hpg_store
hpg_store.apply(lambda row:folium.Marker(
location=[row['latitude'], row['longitude']],
popup=row['hpg_store_id'],icon=None,
tooltip=row['hpg_genre_name']).add_to(m),
axis=1)## 0 <folium.map.Marker object at 0x7f9504638c40>
## 1 <folium.map.Marker object at 0x7f95046381c0>
## 2 <folium.map.Marker object at 0x7f950463b3d0>
## 3 <folium.map.Marker object at 0x7f950463bb50>
## 4 <folium.map.Marker object at 0x7f950463abc0>
## ...
## 4685 <folium.map.Marker object at 0x7f94faf8a230>
## 4686 <folium.map.Marker object at 0x7f94faf8a4a0>
## 4687 <folium.map.Marker object at 0x7f94faf88220>
## 4688 <folium.map.Marker object at 0x7f94faf88580>
## 4689 <folium.map.Marker object at 0x7f94faf888e0>
## Length: 4690, dtype: object
# # Define a function to add markers for air_store
# def add_air_marker(row):
# folium.Marker(location=[row['latitude'], row['longitude']],
# popup=row['air_store_id'],
# icon=None,
# tooltip=row['air_genre_name']).add_to(m)
# Add markers for air_store
air_store.apply(lambda row:folium.Marker(
location=[row['latitude'], row['longitude']],
popup=row['air_store_id'],
icon=None,
tooltip=row['air_genre_name']).add_to(m),
axis=1)## 0 <folium.map.Marker object at 0x7f94faf89150>
## 1 <folium.map.Marker object at 0x7f94faf8a7a0>
## 2 <folium.map.Marker object at 0x7f94fb2c1f30>
## 3 <folium.map.Marker object at 0x7f94fb2c3520>
## 4 <folium.map.Marker object at 0x7f94fb2c0970>
## ...
## 824 <folium.map.Marker object at 0x7f95196507c0>
## 825 <folium.map.Marker object at 0x7f9519650820>
## 826 <folium.map.Marker object at 0x7f9519652890>
## 827 <folium.map.Marker object at 0x7f9519651c30>
## 828 <folium.map.Marker object at 0x7f9519653a60>
## Length: 829, dtype: object
# add a marker for med_coord_air
folium.Marker(location=[35.658068, 139.685474],
popup="Center Point",
icon=folium.Icon(color='green'),
tooltip="Center Point").add_to(m)## <folium.map.Marker object at 0x7f950463b1f0>
## <folium.folium.Map object at 0x7f95046393f0>